[svn-inject] Installing original source of mod-proxy-html

author Emmanuel Lacour <elacour@home-dn.net>

Sat, 13 Oct 2007 14:27:09 +0000 (14:27 +0000)

committer Emmanuel Lacour <elacour@home-dn.net>

Sat, 13 Oct 2007 14:27:09 +0000 (14:27 +0000)
author Emmanuel Lacour <elacour@home-dn.net>
Sat, 13 Oct 2007 14:27:09 +0000 (14:27 +0000)
committer Emmanuel Lacour <elacour@home-dn.net>
Sat, 13 Oct 2007 14:27:09 +0000 (14:27 +0000)
diff --git a/config.html b/config.html

new file mode 100644 (file)

index 0000000..1317a3e
--- /dev/null
+++ b/config.html
@@ -0,0 +1,154 @@
+<?xml version="1.0" encoding="utf-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html lang="en"><head>
+<title>mod_proxy_html</title>
+<style type="text/css">
+@import url(/index.css) ;
+</style>
+</head><body>
+<div id="apache">
+<h1>mod_proxy_html: Configuration</h1>
+<p><a href="./">mod_proxy_html</a> Version 2.4 (Sept-Nov 2004)</p>
+<h2>Configuration Directives</h2>
+<p>The following can be used anywhere in an <strong>httpd.conf</strong>
+or included configuration file.</p>
+<dl>
+<dt>ProxyHTMLURLMap</dt>
+<dd>
+<p>Syntax:
+<code>ProxyHTMLURLMap  from-pattern    to-pattern flags</code></p>
+<p>This is the key directive for rewriting HTML links.  When parsing a document,
+whenever a link target matches <samp>from-pattern</samp>, the matching
+portion will be rewritten to <samp>to-pattern</samp>.</p>
+<p>Starting at version 2.0, this supports a wider range of pattern-matching
+and substitutions, including regular expression search and replace,
+controlled by the optional third <code>flags</code> argument.
+</p>
+<h3>Flags for ProxyHTMLURLMap</h3>
+<p>Flags are case-sensitive.</p>
+<dl class="table">
+<dt>h</dt>
+<dd><p>Ignore HTML links (pass through unchanged)</p></dd>
+<dt>e</dt>
+<dd><p>Ignore scripting events (pass through unchanged)</p></dd>
+<dt>c</dt>
+<dd><p>Pass embedded script and style sections through untouched.</p></dd>
+<dt>L</dt>
+<dd><p>Last-match.  If this rule matches, no more rules are applied
+(note that this happens automatically for HTML links).</p></dd>
+<dt>R</dt>
+<dd><p>Use Regular Expression matching-and-replace.  <code>from-pattern</code>
+is a regexp, and <code>to-pattern</code> a replacement string that may be
+based on the regexp.  Regexp memory is supported: you can use brackets ()
+in the <code>from-pattern</code> and retrieve the matches with $1 to $9
+in the <code>to-pattern</code>.</p>
+<p>If R is not set, it will use string-literal search-and-replace, as in
+versions 1.x.  Logic is <em>starts-with</em> in HTML links, but
+<em>contains</em> in scripting events and embedded script and style sections.
+</p>
+</dd>
+<dt>x</dt>
+<dd><p>Use POSIX extended Regular Expressions.  Only applicable with R.</p></dd>
+<dt>i</dt>
+<dd><p>Case-insensitive matching.  Only applicable with R.</p></dd>
+<dt>n</dt>
+<dd><p>Disable regexp memory (for speed).  Only applicable with R.</p></dd>
+<dt>s</dt>
+<dd><p>Line-based regexp matching.  Only applicable with R.</p></dd>
+<dt>^</dt>
+<dd><p>Match at start only.  This applies only to string matching
+(not regexps) and is irrelevant to HTML links.</p></dd>
+<dt>$</dt>
+<dd><p>Match at end only.  This applies only to string matching
+(not regexps) and is irrelevant to HTML links.</p></dd>
+</dl>
+<!-- <h4>Examples</h4> -->
+</dd>
+<dt>ProxyHTMLDoctype</dt>
+<dd>
+<p>Syntax: <code>ProxyHTMLDoctype HTML|XHTML [Legacy]</code></p>
+<p>Alternative Syntax: <code>ProxyHTMLDocType fpi [SGML|XML]</code></p>
+<p>In the first form, documents will be declared as HTML 4.01 or XHTML 1.0
+according to the option selected.  This option also determines whether
+HTML or XHTML syntax is used for output.   Note that the format of the
+documents coming from the backend server is immaterial: the parser will
+deal with it automatically.  If the optional second argument is set to
+"Legacy", documents will be declared "Transitional", an option that may
+be necessary if you are proxying pre-1998 content or working with defective
+authoring/publishing tools.</p>
+<p>In the second form, it will insert your own <abbr
+title="Formal Public Identifier">FPI</abbr>.  The optional second
+argument determines whether SGML/HTML or XML/XHTML syntax will be used.</p>
+<p>Starting at version 2.0, the default is changed to omitting any FPI,
+on the grounds that no FPI is better than a bogus one.  If your backend
+generates decent HTML or XHTML, set it accordingly.</p>
+</dd>
+<dt>ProxyHTMLFixups</dt>
+<dd>
+<p>Syntax: <code>ProxyHTMLFixups [lowercase] [dospath] [reset]</code></p>
+<p>This directive takes one to three arguments as follows:</p>
+<ul>
+<li><code>lowercase</code> Urls are rewritten to lowercase</li>
+<li><code>dospath</code> Backslashes in URLs are rewritten to forward slashes.</li>
+<li><code>reset</code> Unset any options set at a higher level in the configuration.</li>
+</ul>
+<p>Take care when using these.  The fixes will correct certain authoring
+mistakes, but risk also erroneously fixing links that were correct to start with.
+Only use them if you know you have a broken backend server.</p> 
+</dd>
+<dt>ProxyHTMLMeta</dt>
+<dd><p>Syntax <code>ProxyHTMLMeta [On|Off]</code></p>
+<p>Parses <code>&lt;meta http-equiv ...&gt;</code> elements to real HTTP
+headers.</p>
+</dd>
+<dt>ProxyHTMLExtended</dt>
+<dd><p>Syntax <code>ProxyHTMLExtended [On|Off]</code></p>
+<p>Set to <code>Off</code>, this gives the same behaviour as 1.x versions
+of mod_proxy_html.  HTML links are rewritten according the ProxyHTMLURLMap
+directives, but links appearing in Javascript and CSS are ignored.</p>
+<p>Set to <code>On</code>, all scripting events and embedded scripts or
+stylesheets are also processed by the ProxyHTMLURLMap rules, according to
+the flags set for each rule.  Since this requires more parsing, performance
+will be best if you only enable it when strictly necessary.</p>
+</dd>
+<dt>ProxyHTMLStripComments</dt>
+<dd><p>Syntax <code>ProxyHTMLStripComments [On|Off]</code></p>
+<p>This directive will cause mod_proxy_html to strip HTML comments.
+Note that this will also kill off any scripts or styles embedded in
+comments (a bogosity introduced in 1995/6 with Netscape 2 for the
+benefit of then-older browsers, but still in use today).
+It may also interfere with comment-based processors such as SSI or ESI:
+be sure to run any of those <em>before</em> mod_proxy_html in the
+filter chain if stripping comments!</p>
+</dd>
+<dt>ProxyHTMLLogVerbose</dt>
+<dd><p>Syntax <code>ProxyHTMLLogVerbose [On|Off]</code></p>
+<p>Turns on verbose logging.  This causes mod_proxy_html to make
+error log entries (at <code>LogLevel Info</code>) about charset
+detection and about all meta substitutions and rewrites made.
+When Off, only errors and warnings (if any) are logged.</p>
+</dd>
+<dt>ProxyHTMLBufSize</dt>
+<dd><p>Syntax <code>ProxyHTMLBufSize nnnn</code></p>
+<p>Set the buffer size increment for buffering inline stylesheets and scripts.</p>
+<p>In order to parse non-HTML content (stylesheets and scripts), mod_proxy_html
+has to read the entire script or stylesheet into a buffer.  This buffer will
+be expanded as necessary to hold the largest script or stylesheet in a page,
+in increments of [nnnn] as set by this directive.</p>
+<p>The default is 8192, and will work well for almost all pages.  However,
+if you know you're proxying a lot of pages containing stylesheets and/or
+scripts bigger than 8K, it will be more efficient to set a larger buffer
+size and avoid the need to resize the buffer dynamically during a request.
+</p>
+</dd>
+</dl>
+</div>
+<div id="navbar"><a class="internal" href="./" title="Up">Up</a>
+*
+<a class="internal" href="/" title="WebThing Apache Centre">Home</a>
+*
+<a class="internal" href="/contact.html" title="Contact WebThing">Contact</a>
+*
+<a class="external" href="http://www.webthing.com/" title="WebThing Ltd">Web&#222;ing</a>
+*
+<a class="external" href="http://www.apache.org/" title="Apache Software Foundation">Apache</a></div></body></html>
diff --git a/guide.html b/guide.html

new file mode 100644 (file)

index 0000000..50ffee8
--- /dev/null
+++ b/guide.html
@@ -0,0 +1,202 @@
+<?xml version="1.0" encoding="utf-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html lang="en"><head>
+<title>Technical guide: mod_proxy_html</title>
+<style type="text/css">
+@import url(/index.css) ;
+</style>
+</head><body>
+<div id="apache">
+<h1>mod_proxy_html: Technical Guide</h1>
+<p><a href="./">mod_proxy_html</a> Version 2.4 (Sept-Nov 2004).</p>
+<h2>Contents</h2>
+<ul id="toc">
+<li><a href="#url">URL Rewriting</a>
+<ul>
+<li><a href="#html">HTML Links</a></li>
+<li><a href="#event">Scripting Events</a></li>
+<li><a href="#cdata">Embedded Scripts and Stylesheets</a></li>
+</ul>
+</li>
+<li><a href="#output">Output Transformation</a>
+<ul>
+<li><a href="#fpi">FPI (Doctype)</a></li>
+<li><a href="#ml">HTML vs XHTML</a></li>
+<li><a href="#charset">Character Encoding</a></li>
+</ul>
+</li>
+<li><a href="#meta">meta http-equiv support</a></li>
+<li><a href="#misc">Other Fixups</a></li>
+<li><a href="#debug">Debugging your Configuration</a></li>
+<li><a href="#browser">Workarounds for Browser Bugs</a></li>
+</ul>
+<h2 id="url">URL Rewriting</h2>
+<p>Rewriting URLs into a proxy's address space is of course the primary
+purpose of this module.  From Version 2.0, this capability has been
+extended from rewriting HTML URLs to processing scripts and stylesheets
+that <em>may</em> contain URLs.</p>
+<p>Because the module doesn't contain parsers for javascript or CSS, this
+additional processing means we have had to introduce some heuristic parsing.
+What that means is that the parser cannot automatically distinguish between
+a URL that should be replaced and one that merely appears as text.  It's
+up to you to match the right things!  To help you do this, we have introduced
+some new features:</p>
+<ol>
+<li>The <code>ProxyHTMLExtended</code> directive.  The extended processing
+will only be activated if this is On.  The default is Off, which gives you
+the old behaviour.</li>
+<li>Regular Expression match-and-replace.  This can be used anywhere,
+but is most useful where context information can help distinguish URLs
+that should be replaced and avoid false positives.  For example,
+to rewrite URLs of CSS @import, we might define a rule<br />
+<code>ProxyHTMLURLMap  url\(http://internal.example.com([^\)]*)\)      url(http://proxy.example.com$1) Rihe</code><br />
+This explicitly rewrites from one servername to another, and uses regexp
+memory to match a path and append it unchanged in $1, while using the
+<code>url(...)</code> context to reduce the danger of a match that shouldn't
+be rewritten.  The <b>R</b> flag invokes regexp processing for this rule;
+<b>i</b> makes the match case-insensitive; while <b>h</b> and <b>e</b>
+save processing cycles by preventing the match being applied to HTML links
+and scripting events, where it is clearly irrelevant.</li>
+</ol>
+<h3 id="html">HTML Links</h3>
+<p>HTML links are those attributes defined by the HTML 4 and XHTML 1
+DTDs as of type <strong>%URI</strong>.  For example, the <strong>href</strong>
+attribute of the <strong>a</strong> element.  For a full list, see the
+declaration of <code>linked_elts</code> in <code>pstartElement</code>.
+Rules are applicable provided the <b>h</b> flag is not set.</p>
+<p>An HTML link always contains exactly one URL.  So whenever mod_proxy_html
+finds a matching <code>ProxyHTMLURLMap</code> rule, it will apply the
+transformation once and stop processing the attribute.</p>
+<h3 id="event">Scripting Events</h3>
+<p>Scripting events are the contents of event attributes as defined in the
+HTML4 and XHTML1 DTDs; for example <code>onclick</code>.  For a full list,
+see the declaration of <code>events</code> in <code>pstartElement</code>.
+Rules are applicable provided the <b>e</b> flag is not set.</p>
+<p>A scripting event may contain more than one URL, and will contain other
+text.  So when <code>ProxyHTMLExtended</code> is On, all applicable rules
+will be applied in order until and unless a rule with the <b>L</b> flag
+matches.  A rule may match more than once, provided the matches do not
+overlap, so a URL/pattern that appears more than once is rewritten
+every time it matches.</p>
+<h3 id="cdata">Embedded Scripts and Stylesheets</h3>
+<p>Embedded scripts and stylesheets are the contents of
+<code>&lt;script&gt;</code> and <code>&lt;style&gt;</code> elements.
+Rules are applicable provided the <b>c</b> flag is not set.</p>
+<p>A script or stylesheet may contain more than one URL, and will contain other
+text.  So when <code>ProxyHTMLExtended</code> is On, all applicable rules
+will be applied in order until and unless a rule with the <b>L</b> flag
+matches.  A rule may match more than once, provided the matches do not
+overlap, so a URL/pattern that appears more than once is rewritten
+every time it matches.</p>
+<h2 id="output">Output Transformation</h2>
+<p>mod_proxy_html uses a SAX parser.  This means that the input stream
+- and hence the output generated - will be normalised in various ways,
+even where nothing is actually rewritten.  To an HTML or XML parser,
+the document is not changed by normalisation, except as noted below.
+Exceptions to this may arise where the input stream is malformed, when
+the output of mod_proxy_html may be undefined.  These should of course
+be fixed at the backend: if mod_proxy_html doesn't work as expected,
+then neither will browsers in real life, except by coincidence.</p>
+<h3 id="fpi">FPI (Doctype)</h3>
+<p>Strictly speaking, HTML and XHTML documents are required to have a
+Formal Public Identifier (FPI), also know as a Document Type Declaration.
+This references a Document Type Definition (DTD) which defines the grammar/
+syntax to which the contents of the document must conform.</p>
+<p>The parser in mod_proxy_html loses any FPI in the input document, but
+gives you the option to insert one.  You may select either HTML or XHTML
+(see below), and if your backend is sloppy you may also want to use the
+"Legacy" keyword to make it declare documents "Transitional".  You may
+also declare a custom DTD, or (if your backend is seriously screwed
+so no DTD would be appropriate) omit it altogether.</p>
+<h3 id="ml">HTML vs XHTML</h3>
+<p>The differences between HTML 4.01 and XHTML 1.0 are essentially negligible,
+and mod_proxy_html can transform between the two.  You can safely select
+either, regardless of what the backend generates, and mod_proxy_html will
+apply the appropriate rules in generating output.  HTML saves a few bytes.</p>
+<p>If you declare a custom DTD, you should specify whether to generate
+HTML or XHTML syntax in the output.  This affects empty elements:
+HTML <b>&lt;br&gt;</b> vs XHTML <b>&lt;br /&gt;</b>.</p>
+<h3 id="charset">Character Encoding</h3>
+<p>The parser uses <strong>UTF-8</strong> (Unicode) internally, and
+mod_proxy_html <em>always</em> generates output as UTF-8.  This is
+supported by all general-purpose web software, and supports more
+character sets and languages than any other charset.</p>
+<p>The character encoding should be declared in HTTP: for example<br />
+<code>Content-Type: text/html; charset=latin1</code><br />
+mod_proxy_html has always supported this in its input, and ensured
+this happens in output.  But prior to version 2, it did not fully
+support detection (sniffing) the charset when a backend fails to
+set the HTTP Header.</p>
+<p>From version 2.0, mod_proxy_html will detect the encoding of its input
+as follows:</p>
+<ol>
+<li>The HTTP headers, where available, always take precedence over other
+information.</li>
+<li>If the first 2-4 bytes are an XML Byte Order Mark (BOM), this is used.</li>
+<li>If the document starts with an XML declaration
+<code>&lt;?xml .... ?&gt;</code>, this determines encoding by XML rules.</li>
+<li>If the document contains the HTML hack
+<code>&lt;meta http-equiv="Content-Type" ...&gt;</code>, any charset declared
+here is used.</li>
+<li>In the absence of any of the above indications, the HTML-over-HTTP default
+encoding <b>ISO-8859-1</b> is assumed.</li>
+<li>The parser is set to ignore invalid characters, so a malformed input
+stream will generate glitches (unexpected characters) rather than risk
+aborting a parse altogether.</li>
+</ol>
+<h2 id="meta">meta http-equiv support</h2>
+<p>The HTML <code>meta</code> element includes a form
+<code>&lt;meta http-equiv="Some-Header" contents="some-value"&gt;</code>
+which should notionally be converted to a real HTTP header by the webserver.
+In practice, it is more commonly supported in browsers than servers, and
+is common in constructs such as ClientPull (aka "meta refresh").
+The <code>ProxyHTMLMeta</code> directive supports the server generating
+real HTTP headers from these.  However, it does not strip them from the
+HTML (except for Content-Type, which is removed in case it contains
+conflicting charset information).</p>
+<h2 id="misc">Other Fixups</h2>
+<p>For additional minor functions of mod_proxy_html, please see the
+<code>ProxyHTMLFixups</code> and <code>ProxyHTMLStripComments</code>
+directives in the <a href="config.html">Configuration Guide</a>.</p>
+<h2 id="debug">Debugging your Configuration</h2>
+<p>From Version 2.1, mod_proxy_html supports a <code>ProxyHTMLLogVerbose</code>
+directive, to enable verbose logging at <code>LogLevel Info</code>.  This
+is designed to help with setting up your proxy configuration and
+diagnosing unexpected behaviour; it is not recommended for normal
+operation, and can be disabled altogether at compile time for extra
+performance (see the top of the source).</p>
+<p>When verbose logging is enabled, the following messages will be logged:</p>
+<ol>
+<li>In <strong>Charset Detection</strong>, it will report what charset is
+detected and how (HTTP rules, XML rules, or HTML rules).  Note that,
+regardless of verbose logging, an error or warning will be logged if an
+unsupported charset is detected or if no information can be found.</li>
+<li>When <code>ProxyHTMLMeta</code> is enabled, it logs each header/value
+pair processed.</li>
+<li>Whenever a <code>ProxyHTMLURLMap</code> rule matches and causes a
+rewrite, it is logged.  The message contains abbreviated context information:
+<strong>H</strong> denotes an HTML link matched; <strong>E</strong>
+denotes a match in a scripting event, <strong>C</strong> denotes a match
+in an inline script or stylesheet.  When the match is a regexp
+find-and-replace, it is also marked as <strong>RX</strong>.</li>
+</ol>
+<h2 id="browser">Workarounds for Browser Bugs</h2>
+<p>Because mod_proxy_html unsets the Content-Length header, it risks
+losing the performance advantage of HTTP Keep-Alive.  It therefore sets
+up HTTP Chunked Encoding when responding to HTTP/1.1 requests.  This
+enables keep-alive again for HTTP/1.1 agents.</p>
+<p>Unfortunately some buggy agents will send an HTTP/1.1 request but
+choke on an HTTP/1.1 response.  Typically you will see numbers before
+and after, and possibly in the middle of, a page.  To work around this, set the
+<code>force-response-1.0</code> environment variable in httpd.conf.
+For example,<br /><code>BrowserMatch MSIE force-response-1.0</code></p>
+</div>
+<div id="navbar"><a class="internal" href="./" title="Up">Up</a>
+*
+<a class="internal" href="/" title="WebThing Apache Centre">Home</a>
+*
+<a class="internal" href="/contact.html" title="Contact WebThing">Contact</a>
+*
+<a class="external" href="http://www.webthing.com/" title="WebThing Ltd">Web&#222;ing</a>
+*
+<a class="external" href="http://www.apache.org/" title="Apache Software Foundation">Apache</a></div></body></html>
diff --git a/mod_proxy_html.c b/mod_proxy_html.c

new file mode 100644 (file)

index 0000000..dfdbf60
--- /dev/null
+++ b/mod_proxy_html.c
@@ -0,0 +1,1041 @@
+/********************************************************************
+        Copyright (c) 2003-4, WebThing Ltd
+        Author: Nick Kew <nick@webthing.com>
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+     
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+      
+*********************************************************************/
+
+
+/********************************************************************
+       Note to Users
+ 
+       You are requested to register as a user, at
+       http://apache.webthing.com/registration.html
+ 
+       This entitles you to support from the developer.
+       I'm unlikely to reply to help/support requests from
+       non-registered users, unless you're paying and/or offering
+       constructive feedback such as bug reports or sensible
+       suggestions for further development.
+ 
+       It also makes a small contribution to the effort
+       that's gone into developing this work.
+*********************************************************************/
+
+/* End of Notices */
+
+
+
+
+/*     GO_FASTER
+
+       You can #define GO_FASTER to disable informational logging.
+       This disables the ProxyHTMLLogVerbose option altogether.
+
+       Default is to leave it undefined, and enable verbose logging
+       as a configuration option.  Binaries are supplied with verbose
+       logging enabled.
+*/
+
+#ifdef GO_FASTER
+#define VERBOSE(x)
+#else
+#define VERBOSE(x) if ( verbose ) x
+#endif
+
+#define VERSION_STRING "proxy_html/2.4"
+
+#include <ctype.h>
+
+/* libxml */
+#include <libxml/HTMLparser.h>
+
+/* apache */
+#include <http_protocol.h>
+#include <http_config.h>
+#include <http_log.h>
+#include <apr_strings.h>
+
+module AP_MODULE_DECLARE_DATA proxy_html_module ;
+
+#define M_HTML         0x01
+#define M_EVENTS       0x02
+#define M_CDATA                0x04
+#define M_REGEX                0x08
+#define M_ATSTART      0x10
+#define M_ATEND                0x20
+#define M_LAST         0x40
+
+typedef struct {
+  unsigned int start ;
+  unsigned int end ;
+} meta ;
+typedef struct urlmap {
+  struct urlmap* next ;
+  unsigned int flags ;
+  union {
+    const char* c ;
+    regex_t* r ;
+  } from ;
+  const char* to ;
+} urlmap ;
+typedef struct {
+  urlmap* map ;
+  const char* doctype ;
+  const char* etag ;
+  unsigned int flags ;
+  int extfix ;
+  int metafix ;
+  int strip_comments ;
+#ifndef GO_FASTER
+  int verbose ;
+#endif
+  size_t bufsz ;
+} proxy_html_conf ;
+typedef struct {
+  htmlSAXHandlerPtr sax ;
+  ap_filter_t* f ;
+  proxy_html_conf* cfg ;
+  htmlParserCtxtPtr parser ;
+  apr_bucket_brigade* bb ;
+  char* buf ;
+  size_t offset ;
+  size_t avail ;
+} saxctxt ;
+
+static int is_empty_elt(const char* name) {
+  const char** p ;
+  static const char* empty_elts[] = {
+    "br" ,
+    "link" ,
+    "img" ,
+    "hr" ,
+    "input" ,
+    "meta" ,
+    "base" ,
+    "area" ,
+    "param" ,
+    "col" ,
+    "frame" ,
+    "isindex" ,
+    "basefont" ,
+    NULL
+  } ;
+  for ( p = empty_elts ; *p ; ++p )
+    if ( !strcmp( *p, name) )
+      return 1 ;
+  return 0 ;
+}
+
+typedef struct {
+       const char* name ;
+       const char** attrs ;
+} elt_t ;
+
+#define NORM_LC 0x1
+#define NORM_MSSLASH 0x2
+#define NORM_RESET 0x4
+
+typedef enum { ATTR_IGNORE, ATTR_URI, ATTR_EVENT } rewrite_t ;
+
+static void normalise(unsigned int flags, char* str) {
+  xmlChar* p ;
+  if ( flags & NORM_LC )
+    for ( p = str ; *p ; ++p )
+      if ( isupper(*p) )
+       *p = tolower(*p) ;
+
+  if ( flags & NORM_MSSLASH )
+    for ( p = strchr(str, '\\') ; p ; p = strchr(p+1, '\\') )
+      *p = '/' ;
+
+}
+
+#define FLUSH ap_fwrite(ctx->f->next, ctx->bb, (chars+begin), (i-begin)) ; begin = i+1
+static void pcharacters(void* ctxt, const xmlChar *chars, int length) {
+  saxctxt* ctx = (saxctxt*) ctxt ;
+  int i ;
+  int begin ;
+  for ( begin=i=0; i<length; i++ ) {
+    switch (chars[i]) {
+      case '&' : FLUSH ; ap_fputs(ctx->f->next, ctx->bb, "&amp;") ; break ;
+      case '<' : FLUSH ; ap_fputs(ctx->f->next, ctx->bb, "&lt;") ; break ;
+      case '>' : FLUSH ; ap_fputs(ctx->f->next, ctx->bb, "&gt;") ; break ;
+      case '"' : FLUSH ; ap_fputs(ctx->f->next, ctx->bb, "&quot;") ; break ;
+      default : break ;
+    }
+  }
+  FLUSH ;
+}
+static void preserve(saxctxt* ctx, const size_t len) {
+  char* newbuf ;
+  if ( len <= ( ctx->avail - ctx->offset ) )
+    return ;
+  else while ( len > ( ctx->avail - ctx->offset ) )
+    ctx->avail += ctx->cfg->bufsz ;
+
+  newbuf = realloc(ctx->buf, ctx->avail) ;
+  if ( newbuf != ctx->buf ) {
+    if ( ctx->buf )
+       apr_pool_cleanup_kill(ctx->f->r->pool, ctx->buf, (void*)free) ;
+    apr_pool_cleanup_register(ctx->f->r->pool, newbuf,
+       (void*)free, apr_pool_cleanup_null);
+    ctx->buf = newbuf ;
+  }
+}
+static void pappend(saxctxt* ctx, const char* buf, const size_t len) {
+  preserve(ctx, len) ;
+  memcpy(ctx->buf+ctx->offset, buf, len) ;
+  ctx->offset += len ;
+}
+static void dump_content(saxctxt* ctx) {
+  urlmap* m ;
+  char* found ;
+  size_t s_from, s_to ;
+  size_t match ;
+  char c = 0 ;
+  int nmatch ;
+  regmatch_t pmatch[10] ;
+  char* subs ;
+  size_t len, offs ;
+#ifndef GO_FASTER
+  int verbose = ctx->cfg->verbose ;
+#endif
+
+  pappend(ctx, &c, 1) ;        /* append null byte */
+       /* parse the text for URLs */
+  for ( m = ctx->cfg->map ; m ; m = m->next ) {
+    if ( ! ( m->flags & M_CDATA ) )
+       continue ;
+    if ( m->flags & M_REGEX ) {
+      nmatch = 10 ;
+      offs = 0 ;
+      while ( ! ap_regexec(m->from.r, ctx->buf+offs, nmatch, pmatch, 0) ) {
+       match = pmatch[0].rm_so ;
+       s_from = pmatch[0].rm_eo - match ;
+       subs = ap_pregsub(ctx->f->r->pool, m->to, ctx->buf+offs,
+               nmatch, pmatch) ;
+       s_to = strlen(subs) ;
+       len = strlen(ctx->buf) ;
+       offs += match ;
+       VERBOSE( {
+         const char* f = apr_pstrndup(ctx->f->r->pool,
+               ctx->buf + offs , s_from ) ;
+         ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r,
+               "C/RX: match at %s, substituting %s", f, subs) ;
+       } )
+       if ( s_to > s_from) {
+         preserve(ctx, s_to - s_from) ;
+         memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from,
+               len + 1 - s_from - offs) ;
+         memcpy(ctx->buf+offs, subs, s_to) ;
+       } else {
+         memcpy(ctx->buf + offs, subs, s_to) ;
+         memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from,
+               len + 1 - s_from - offs) ;
+       }
+       offs += s_to ;
+      }
+    } else {
+      s_from = strlen(m->from.c) ;
+      s_to = strlen(m->to) ;
+      for ( found = strstr(ctx->buf, m->from.c) ; found ;
+               found = strstr(ctx->buf+match+s_to, m->from.c) ) {
+       match = found - ctx->buf ;
+       if ( ( m->flags & M_ATSTART ) && ( match != 0) )
+         break ;
+       len = strlen(ctx->buf) ;
+       if ( ( m->flags & M_ATEND ) && ( match < (len - s_from) ) )
+         continue ;
+       VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r,
+           "C: matched %s, substituting %s", m->from.c, m->to) ) ;
+       if ( s_to > s_from ) {
+         preserve(ctx, s_to - s_from) ;
+         memmove(ctx->buf+match+s_to, ctx->buf+match+s_from,
+               len + 1 - s_from - match) ;
+         memcpy(ctx->buf+match, m->to, s_to) ;
+       } else {
+         memcpy(ctx->buf+match, m->to, s_to) ;
+         memmove(ctx->buf+match+s_to, ctx->buf+match+s_from,
+               len + 1 - s_from - match) ;
+       }
+      }
+    }
+  }
+  ap_fputs(ctx->f->next, ctx->bb, ctx->buf) ;
+}
+static void pcdata(void* ctxt, const xmlChar *chars, int length) {
+  saxctxt* ctx = (saxctxt*) ctxt ;
+  if ( ctx->cfg->extfix ) {
+    pappend(ctx, chars, length) ;
+  } else {
+    ap_fwrite(ctx->f->next, ctx->bb, chars, length) ;
+  }
+}
+static void pcomment(void* ctxt, const xmlChar *chars) {
+  saxctxt* ctx = (saxctxt*) ctxt ;
+  if ( ctx->cfg->strip_comments )
+    return ;
+
+  if ( ctx->cfg->extfix ) {
+    pappend(ctx, "<!--", 4) ;
+    pappend(ctx, chars, strlen(chars) ) ;
+    pappend(ctx, "-->", 3) ;
+  } else {
+    ap_fputstrs(ctx->f->next, ctx->bb, "<!--", chars, "-->", NULL) ;
+  }
+}
+static void pendElement(void* ctxt, const xmlChar* name) {
+  saxctxt* ctx = (saxctxt*) ctxt ;
+  if ( ctx->offset > 0 ) {
+    dump_content(ctx) ;
+    ctx->offset = 0 ;  /* having dumped it, we can re-use the memory */
+  }
+  if ( ! is_empty_elt(name) )
+    ap_fprintf(ctx->f->next, ctx->bb, "</%s>", name) ;
+}
+static void pstartElement(void* ctxt, const xmlChar* name,
+               const xmlChar** attrs ) {
+
+  int num_match ;
+  size_t offs, len ;
+  char* subs ;
+  rewrite_t is_uri ;
+  const char** linkattrs ;
+  const xmlChar** a ;
+  const elt_t* elt ;
+  const char** linkattr ;
+  urlmap* m ;
+  size_t s_to, s_from, match ;
+  char* found ;
+  saxctxt* ctx = (saxctxt*) ctxt ;
+  size_t nmatch ;
+  regmatch_t pmatch[10] ;
+#ifndef GO_FASTER
+  int verbose = ctx->cfg->verbose ;
+#endif
+
+  static const char* href[] = { "href", NULL } ;
+  static const char* cite[] = { "cite", NULL } ;
+  static const char* action[] = { "action", NULL } ;
+  static const char* imgattr[] = { "src", "longdesc", "usemap", NULL } ;
+  static const char* inputattr[] = { "src", "usemap", NULL } ;
+  static const char* scriptattr[] = { "src", "for", NULL } ;
+  static const char* frameattr[] = { "src", "longdesc", NULL } ;
+  static const char* objattr[] = { "classid", "codebase", "data", "usemap", NULL } ;
+  static const char* profile[] = { "profile", NULL } ;
+  static const char* background[] = { "background", NULL } ;
+  static const char* codebase[] = { "codebase", NULL } ;
+
+  static const elt_t linked_elts[] = {
+    { "a" , href } ,
+    { "img" , imgattr } ,
+    { "form", action } ,
+    { "link" , href } ,
+    { "script" , scriptattr } ,
+    { "base" , href } ,
+    { "area" , href } ,
+    { "input" , inputattr } ,
+    { "frame", frameattr } ,
+    { "iframe", frameattr } ,
+    { "object", objattr } ,
+    { "q" , cite } ,
+    { "blockquote" , cite } ,
+    { "ins" , cite } ,
+    { "del" , cite } ,
+    { "head" , profile } ,
+    { "body" , background } ,
+    { "applet", codebase } ,
+    { NULL, NULL }
+  } ;
+  static const char* events[] = {
+       "onclick" ,
+       "ondblclick" ,
+       "onmousedown" ,
+       "onmouseup" ,
+       "onmouseover" ,
+       "onmousemove" ,
+       "onmouseout" ,
+       "onkeypress" ,
+       "onkeydown" ,
+       "onkeyup" ,
+       "onfocus" ,
+       "onblur" ,
+       "onload" ,
+       "onunload" ,
+       "onsubmit" ,
+       "onreset" ,
+       "onselect" ,
+       "onchange" ,
+       NULL
+  } ;
+
+  ap_fputc(ctx->f->next, ctx->bb, '<') ;
+  ap_fputs(ctx->f->next, ctx->bb, name) ;
+
+  if ( attrs ) {
+    linkattrs = 0 ;
+    for ( elt = linked_elts;  elt->name != NULL ; ++elt )
+      if ( !strcmp(elt->name, name) ) {
+       linkattrs = elt->attrs ;
+       break ;
+      }
+    for ( a = attrs ; *a ; a += 2 ) {
+      ctx->offset = 0 ;
+      if ( a[1] ) {
+       pappend(ctx, a[1], strlen(a[1])+1) ;
+       is_uri = ATTR_IGNORE ;
+       if ( linkattrs ) {
+         for ( linkattr = linkattrs ; *linkattr ; ++linkattr) {
+           if ( !strcmp(*linkattr, *a) ) {
+             is_uri = ATTR_URI ;
+             break ;
+           }
+         }
+       }
+       if ( (is_uri == ATTR_IGNORE) && ctx->cfg->extfix ) {
+         for ( linkattr = events; *linkattr; ++linkattr ) {
+           if ( !strcmp(*linkattr, *a) ) {
+             is_uri = ATTR_EVENT ;
+             break ;
+           }
+         }
+       }
+       switch ( is_uri ) {
+         case ATTR_URI:
+           num_match = 0 ;
+           for ( m = ctx->cfg->map ; m ; m = m->next ) {
+             if ( ! ( m->flags & M_HTML ) )
+               continue ;
+             if ( m->flags & M_REGEX ) {
+               nmatch = 10 ;
+               if ( ! ap_regexec(m->from.r, ctx->buf, nmatch, pmatch, 0) ) {
+                 ++num_match ;
+                 offs = match = pmatch[0].rm_so ;
+                 s_from = pmatch[0].rm_eo - match ;
+                 subs = ap_pregsub(ctx->f->r->pool, m->to, ctx->buf+offs,
+                       nmatch, pmatch) ;
+                 VERBOSE( {
+                   const char* f = apr_pstrndup(ctx->f->r->pool,
+                       ctx->buf + offs , s_from ) ;
+                   ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r,
+                       "H/RX: match at %s, substituting %s", f, subs) ;
+                 } )
+                 s_to = strlen(subs) ;
+                 len = strlen(ctx->buf) ;
+                 if ( s_to > s_from) {
+                   preserve(ctx, s_to - s_from) ;
+                   memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from,
+                       len + 1 - s_from - offs) ;
+                   memcpy(ctx->buf+offs, subs, s_to) ;
+                 } else {
+                   memcpy(ctx->buf + offs, subs, s_to) ;
+                   memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from,
+                       len + 1 - s_from - offs) ;
+                 }
+               }
+             } else {
+               s_from = strlen(m->from.c) ;
+               if ( ! strncasecmp(ctx->buf, m->from.c, s_from ) ) {
+                 ++num_match ;
+                 s_to = strlen(m->to) ;
+                 len = strlen(ctx->buf) ;
+                 VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r,
+                   "H: matched %s, substituting %s", m->from.c, m->to) ) ;
+                 if ( s_to > s_from ) {
+                   preserve(ctx, s_to - s_from) ;
+                   memmove(ctx->buf+s_to, ctx->buf+s_from,
+                       len + 1 - s_from ) ;
+                   memcpy(ctx->buf, m->to, s_to) ;
+                 } else {      /* it fits in the existing space */
+                   memcpy(ctx->buf, m->to, s_to) ;
+                   memmove(ctx->buf+s_to, ctx->buf+s_from,
+                       len + 1 - s_from) ;
+                 }
+                 break ;
+               }
+             }
+             if ( num_match > 0 )      /* URIs only want one match */
+               break ;
+           }
+           break ;
+         case ATTR_EVENT:
+           for ( m = ctx->cfg->map ; m ; m = m->next ) {
+             num_match = 0 ;   /* reset here since we're working per-rule */
+             if ( ! ( m->flags & M_EVENTS ) )
+               continue ;
+             if ( m->flags & M_REGEX ) {
+               nmatch = 10 ;
+               offs = 0 ;
+               while ( ! ap_regexec(m->from.r, ctx->buf+offs,
+                       nmatch, pmatch, 0) ) {
+                 match = pmatch[0].rm_so ;
+                 s_from = pmatch[0].rm_eo - match ;
+                 subs = ap_pregsub(ctx->f->r->pool, m->to, ctx->buf+offs,
+                       nmatch, pmatch) ;
+                 VERBOSE( {
+                   const char* f = apr_pstrndup(ctx->f->r->pool,
+                       ctx->buf + offs , s_from ) ;
+                   ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r,
+                       "E/RX: match at %s, substituting %s", f, subs) ;
+                 } )
+                 s_to = strlen(subs) ;
+                 offs += match ;
+                 len = strlen(ctx->buf) ;
+                 if ( s_to > s_from) {
+                   preserve(ctx, s_to - s_from) ;
+                   memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from,
+                       len + 1 - s_from - offs) ;
+                   memcpy(ctx->buf+offs, subs, s_to) ;
+                 } else {
+                   memcpy(ctx->buf + offs, subs, s_to) ;
+                   memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from,
+                       len + 1 - s_from - offs) ;
+                 }
+                 offs += s_to ;
+                 ++num_match ;
+               }
+             } else {
+               found = strstr(ctx->buf, m->from.c) ;
+               if ( (m->flags & M_ATSTART) && ( found != ctx->buf) )
+                 continue ;
+               while ( found ) {
+                 s_from = strlen(m->from.c) ;
+                 s_to = strlen(m->to) ;
+                 match = found - ctx->buf ;
+                 if ( ( s_from < strlen(found) ) && (m->flags & M_ATEND ) ) {
+                   found = strstr(ctx->buf+match+s_from, m->from.c) ;
+                   continue ;
+                 } else {
+                   found = strstr(ctx->buf+match+s_to, m->from.c) ;
+                 }
+                 VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r,
+                   "E: matched %s, substituting %s", m->from.c, m->to) ) ;
+                 len = strlen(ctx->buf) ;
+                 if ( s_to > s_from ) {
+                   preserve(ctx, s_to - s_from) ;
+                   memmove(ctx->buf+match+s_to, ctx->buf+match+s_from,
+                       len + 1 - s_from - match) ;
+                   memcpy(ctx->buf+match, m->to, s_to) ;
+                 } else {
+                   memcpy(ctx->buf+match, m->to, s_to) ;
+                   memmove(ctx->buf+match+s_to, ctx->buf+match+s_from,
+                       len + 1 - s_from - match) ;
+                 }
+                 ++num_match ;
+               }
+             }
+             if ( num_match && ( m->flags & M_LAST ) )
+               break ;
+           }
+           break ;
+         case ATTR_IGNORE:
+           break ;
+       }
+      }
+      if ( ! a[1] )
+       ap_fputstrs(ctx->f->next, ctx->bb, " ", a[0], NULL) ;
+      else {
+
+       if ( ctx->cfg->flags != 0 )
+         normalise(ctx->cfg->flags, ctx->buf) ;
+
+       /* write the attribute, using pcharacters to html-escape
+          anything that needs it in the value.
+       */
+       ap_fputstrs(ctx->f->next, ctx->bb, " ", a[0], "=\"", NULL) ;
+       pcharacters(ctx, ctx->buf, strlen(ctx->buf)) ;
+       ap_fputc(ctx->f->next, ctx->bb, '"') ;
+      }
+    }
+  }
+  ctx->offset = 0 ;
+  if ( is_empty_elt(name) )
+    ap_fputs(ctx->f->next, ctx->bb, ctx->cfg->etag) ;
+  else
+    ap_fputc(ctx->f->next, ctx->bb, '>') ;
+}
+static htmlSAXHandlerPtr setupSAX(apr_pool_t* pool) {
+  htmlSAXHandlerPtr sax = apr_pcalloc(pool, sizeof(htmlSAXHandler) ) ;
+  sax->startDocument = NULL ;
+  sax->endDocument = NULL ;
+  sax->startElement = pstartElement ;
+  sax->endElement = pendElement ;
+  sax->characters = pcharacters ;
+  sax->comment = pcomment ;
+  sax->cdataBlock = pcdata ;
+  return sax ;
+}
+
+static regex_t* seek_meta_ctype ;
+static regex_t* seek_charset ;
+static regex_t* seek_meta ;
+
+static void proxy_html_child_init(apr_pool_t* pool, server_rec* s) {
+  seek_meta_ctype = ap_pregcomp(pool,
+       "(<meta[^>]*http-equiv[ \t\r\n='\"]*content-type[^>]*>)",
+       REG_EXTENDED|REG_ICASE) ;
+  seek_charset = ap_pregcomp(pool, "charset=([A-Za-z0-9_-]+)",
+       REG_EXTENDED|REG_ICASE) ;
+  seek_meta = ap_pregcomp(pool, "<meta[^>]*(http-equiv)[^>]*>",
+       REG_EXTENDED|REG_ICASE) ;
+}
+
+static xmlCharEncoding sniff_encoding(request_rec* r, const char* cbuf, size_t bytes
+#ifndef GO_FASTER
+                       , int verbose
+#endif
+       ) {
+  xmlCharEncoding ret ;
+  char* encoding = NULL ;
+  char* p ;
+  char* q ;
+  regmatch_t match[2] ;
+  unsigned char* buf = (unsigned char*)cbuf ;
+
+  VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,
+               "Content-Type is %s", r->content_type) ) ;
+
+/* If we've got it in the HTTP headers, there's nothing to do */
+  if ( r->content_type &&
+       ( p = ap_strcasestr(r->content_type, "charset=") , p > 0 ) ) {
+    p += 8 ;
+    if ( encoding = apr_pstrndup(r->pool, p, strcspn(p, " ;") ) , encoding ) {
+      if ( ret = xmlParseCharEncoding(encoding),
+               ret != XML_CHAR_ENCODING_ERROR ) {
+       VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,
+               "Got charset %s from HTTP headers", encoding) ) ;
+       return ret ;
+      } else {
+       ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r,
+               "Unsupported charset %s in HTTP headers", encoding) ;
+       encoding = NULL ;
+      }
+    }
+  }
+
+/* to sniff, first we look for BOM */
+  if ( ret = xmlDetectCharEncoding(buf, bytes),
+       ret != XML_CHAR_ENCODING_NONE ) {
+    VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,
+       "Got charset from XML rules.") ) ;
+    return ret ;
+  }
+
+/* If none of the above, look for a META-thingey */
+  encoding = NULL ;
+  if ( ap_regexec(seek_meta_ctype, buf, 1, match, 0) == 0 ) {
+    p = apr_pstrndup(r->pool, buf + match[0].rm_so,
+       match[0].rm_eo - match[0].rm_so) ;
+    if ( ap_regexec(seek_charset, p, 2, match, 0) == 0 )
+      encoding = apr_pstrndup(r->pool, p+match[1].rm_so,
+       match[1].rm_eo - match[1].rm_so) ;
+  }
+
+/* either it's set to something we found or it's still the default */
+  if ( encoding )
+    if ( ret = xmlParseCharEncoding(encoding),
+       ret != XML_CHAR_ENCODING_ERROR ) {
+      VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,
+       "Got charset %s from HTML META", encoding) ) ;
+      return ret ;
+    } else {
+      ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r,
+       "Unsupported charset %s in HTML META", encoding) ;
+    }
+
+/* the old HTTP default is a last resort */
+  ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, r,
+       "No usable charset information: using old HTTP default LATIN1") ;
+  return XML_CHAR_ENCODING_8859_1 ;
+}
+static meta* metafix(request_rec* r, const char* buf /*, size_t bytes*/
+#ifndef GO_FASTER
+               , int verbose
+#endif
+       ) {
+  meta* ret = NULL ;
+  size_t offs = 0 ;
+  const char* p ;
+  const char* q ;
+  char* header ;
+  char* content ;
+  regmatch_t pmatch[2] ;
+  char delim ;
+
+  while ( ! ap_regexec(seek_meta, buf+offs, 2, pmatch, 0) ) {
+    header = NULL ;
+    content = NULL ;
+    p = buf+offs+pmatch[1].rm_eo ;
+    while ( !isalpha(*++p) ) ;
+    for ( q = p ; isalnum(*q) || (*q == '-') ; ++q ) ;
+    header = apr_pstrndup(r->pool, p, q-p) ;
+    if ( strncasecmp(header, "Content-", 8) ) {
+/* find content=... string */
+      for ( p = strstr(buf+offs+pmatch[0].rm_so, "content") ; *p ; ) {
+       p += 7 ;
+       while ( *p && isspace(*p) )
+         ++p ;
+       if ( *p != '=' )
+         continue ;
+       while ( *p && isspace(*++p) ) ;
+       if ( ( *p == '\'' ) || ( *p == '"' ) ) {
+         delim = *p++ ;
+         for ( q = p ; *q != delim ; ++q ) ;
+       } else {
+         for ( q = p ; *q && !isspace(*q) && (*q != '>') ; ++q ) ;
+       }
+       content = apr_pstrndup(r->pool, p, q-p) ;
+       break ;
+      }
+    } else if ( !strncasecmp(header, "Content-Type", 12) ) {
+      ret = apr_palloc(r->pool, sizeof(meta) ) ;
+      ret->start = pmatch[0].rm_so ;
+      ret->end = pmatch[0].rm_eo ;
+    }
+    if ( header && content ) {
+      VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,
+       "Adding header [%s: %s] from HTML META", header, content) ) ; 
+      apr_table_setn(r->headers_out, header, content) ;
+    }
+    offs += pmatch[0].rm_eo ;
+  }
+  return ret ;
+}
+
+static int proxy_html_filter_init(ap_filter_t* f) {
+  const char* env ;
+  saxctxt* fctx ;
+
+#if 0
+/* remove content-length filter */
+  ap_filter_rec_t* clf = ap_get_output_filter_handle("CONTENT_LENGTH") ;
+  ap_filter_t* ff = f->next ;
+
+  do {
+    ap_filter_t* fnext = ff->next ;
+    if ( ff->frec == clf )
+      ap_remove_output_filter(ff) ;
+    ff = fnext ;
+  } while ( ff ) ;
+#endif
+
+  fctx = f->ctx = apr_pcalloc(f->r->pool, sizeof(saxctxt)) ;
+  fctx->sax = setupSAX(f->r->pool) ;
+  fctx->f = f ;
+  fctx->bb = apr_brigade_create(f->r->pool, f->r->connection->bucket_alloc) ;
+  fctx->cfg = ap_get_module_config(f->r->per_dir_config,&proxy_html_module);
+
+  if ( f->r->proto_num >= 1001 ) {
+    if ( ! f->r->main && ! f->r->prev ) {
+      env = apr_table_get(f->r->subprocess_env, "force-response-1.0") ;
+      if ( !env )
+       f->r->chunked = 1 ;
+    }
+  }
+
+  apr_table_unset(f->r->headers_out, "Content-Length") ;
+  apr_table_unset(f->r->headers_out, "ETag") ;
+  return OK ;
+}
+static saxctxt* check_filter_init (ap_filter_t* f) {
+
+  const char* errmsg = NULL ;
+  if ( ! f->r->proxyreq ) {
+    errmsg = "Non-proxy request; not inserting proxy-html filter" ;
+  } else if ( ! f->r->content_type ) {
+    errmsg = "No content-type; bailing out of proxy-html filter" ;
+  } else if ( strncasecmp(f->r->content_type, "text/html", 9) &&
+       strncasecmp(f->r->content_type, "application/xhtml+xml", 21) ) {
+    errmsg = "Non-HTML content; not inserting proxy-html filter" ;
+  }
+
+  if ( errmsg ) {
+#ifndef GO_FASTER
+    proxy_html_conf* cfg
+       = ap_get_module_config(f->r->per_dir_config, &proxy_html_module);
+    if ( cfg->verbose ) {
+      ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r, errmsg) ;
+    }
+#endif
+    ap_remove_output_filter(f) ;
+    return NULL ;
+  }
+  if ( ! f->ctx )
+    proxy_html_filter_init(f) ;
+  return f->ctx ;
+}
+static int proxy_html_filter(ap_filter_t* f, apr_bucket_brigade* bb) {
+  apr_bucket* b ;
+  meta* m = NULL ;
+  xmlCharEncoding enc ;
+  const char* buf = 0 ;
+  apr_size_t bytes = 0 ;
+  int xmlopts = XML_PARSE_RECOVER | XML_PARSE_NONET |
+       XML_PARSE_NOBLANKS | XML_PARSE_NOERROR | XML_PARSE_NOWARNING ;
+
+  saxctxt* ctxt = check_filter_init(f) ;
+  if ( ! ctxt )
+    return ap_pass_brigade(f->next, bb) ;
+
+  for ( b = APR_BRIGADE_FIRST(bb) ;
+       b != APR_BRIGADE_SENTINEL(bb) ;
+       b = APR_BUCKET_NEXT(b) ) {
+    if ( APR_BUCKET_IS_EOS(b) ) {
+      if ( ctxt->parser != NULL ) {
+       htmlParseChunk(ctxt->parser, buf, 0, 1) ;
+      }
+      APR_BRIGADE_INSERT_TAIL(ctxt->bb,
+       apr_bucket_eos_create(ctxt->bb->bucket_alloc) ) ;
+      ap_pass_brigade(ctxt->f->next, ctxt->bb) ;
+    } else if ( apr_bucket_read(b, &buf, &bytes, APR_BLOCK_READ)
+             == APR_SUCCESS ) {
+      if ( ctxt->parser == NULL ) {
+       if ( buf[bytes] != 0 ) {
+         /* make a string for parse routines to play with */
+         char* buf1 = apr_palloc(f->r->pool, bytes+1) ;
+         memcpy(buf1, buf, bytes) ;
+         buf1[bytes] = 0 ;
+         buf = buf1 ;
+       }
+#ifndef GO_FASTER
+       enc = sniff_encoding(f->r, buf, bytes, ctxt->cfg->verbose) ;
+       if ( ctxt->cfg->metafix )
+         m = metafix(f->r, buf, ctxt->cfg->verbose) ;
+#else
+       enc = sniff_encoding(f->r, buf, bytes) ;
+       if ( ctxt->cfg->metafix )
+         m = metafix(f->r, buf) ;
+#endif
+       ap_set_content_type(f->r, "text/html;charset=utf-8") ;
+       ap_fputs(f->next, ctxt->bb, ctxt->cfg->doctype) ;
+       if ( m ) {
+         ctxt->parser = htmlCreatePushParserCtxt(ctxt->sax, ctxt,
+               buf, m->start, 0, enc ) ;
+         htmlParseChunk(ctxt->parser, buf+m->end, bytes-m->end, 0) ;
+       } else {
+         ctxt->parser = htmlCreatePushParserCtxt(ctxt->sax, ctxt,
+               buf, bytes, 0, enc ) ;
+       }
+       apr_pool_cleanup_register(f->r->pool, ctxt->parser,
+               (void*)htmlFreeParserCtxt, apr_pool_cleanup_null) ;
+       if ( xmlopts = xmlCtxtUseOptions(ctxt->parser, xmlopts ), xmlopts )
+         ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, f->r,
+               "Unsupported parser opts %x", xmlopts) ;
+      } else {
+       htmlParseChunk(ctxt->parser, buf, bytes, 0) ;
+      }
+    } else {
+      ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, f->r, "Error in bucket read") ;
+    }
+  }
+  /*ap_fflush(ctxt->f->next, ctxt->bb) ;       // uncomment for debug */
+  apr_brigade_cleanup(bb) ;
+  return APR_SUCCESS ;
+}
+static const char* fpi_html =
+       "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\">\n" ;
+static const char* fpi_html_legacy =
+       "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">\n" ;
+static const char* fpi_xhtml =
+       "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n" ;
+static const char* fpi_xhtml_legacy =
+       "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n" ;
+static const char* html_etag = ">" ;
+static const char* xhtml_etag = " />" ;
+/*#define DEFAULT_DOCTYPE fpi_html */
+static const char* DEFAULT_DOCTYPE = "" ;
+#define DEFAULT_ETAG html_etag
+
+static void* proxy_html_config(apr_pool_t* pool, char* x) {
+  proxy_html_conf* ret = apr_pcalloc(pool, sizeof(proxy_html_conf) ) ;
+  ret->doctype = DEFAULT_DOCTYPE ;
+  ret->etag = DEFAULT_ETAG ;
+  ret->bufsz = 8192 ;
+  return ret ;
+}
+static void* proxy_html_merge(apr_pool_t* pool, void* BASE, void* ADD) {
+  proxy_html_conf* base = (proxy_html_conf*) BASE ;
+  proxy_html_conf* add = (proxy_html_conf*) ADD ;
+  proxy_html_conf* conf = apr_palloc(pool, sizeof(proxy_html_conf)) ;
+
+  if ( add->map && base->map ) {
+    urlmap* a ;
+    conf->map = NULL ;
+    for ( a = base->map ; a ; a = a->next ) {
+      urlmap* save = conf->map ;
+      conf->map = apr_pmemdup(pool, a, sizeof(urlmap)) ;
+      conf->map->next = save ;
+    }
+    for ( a = add->map ; a ; a = a->next ) {
+      urlmap* save = conf->map ;
+      conf->map = apr_pmemdup(pool, a, sizeof(urlmap)) ;
+      conf->map->next = save ;
+    }
+  } else
+    conf->map = add->map ? add->map : base->map ;
+
+  conf->doctype = ( add->doctype == DEFAULT_DOCTYPE )
+               ? base->doctype : add->doctype ;
+  conf->etag = ( add->etag == DEFAULT_ETAG ) ? base->etag : add->etag ;
+  conf->bufsz = add->bufsz ;
+  if ( add->flags & NORM_RESET ) {
+    conf->flags = add->flags ^ NORM_RESET ;
+    conf->metafix = add->metafix ;
+    conf->extfix = add->extfix ;
+    conf->strip_comments = add->strip_comments ;
+#ifndef GO_FASTER
+    conf->verbose = add->verbose ;
+#endif
+  } else {
+    conf->flags = base->flags | add->flags ;
+    conf->metafix = base->metafix | add->metafix ;
+    conf->extfix = base->extfix | add->extfix ;
+    conf->strip_comments = base->strip_comments | add->strip_comments ;
+#ifndef GO_FASTER
+    conf->verbose = base->verbose | add->verbose ;
+#endif
+  }
+  return conf ;
+}
+#define REGFLAG(n,s,c) ( (s&&(strchr((s),(c))!=NULL)) ? (n) : 0 )
+#define XREGFLAG(n,s,c) ( (!s||(strchr((s),(c))==NULL)) ? (n) : 0 )
+static const char* set_urlmap(cmd_parms* cmd, void* CFG,
+       const char* from, const char* to, const char* flags) {
+  int regflags ;
+  proxy_html_conf* cfg = (proxy_html_conf*)CFG ;
+  urlmap* map ;
+  urlmap* newmap = apr_palloc(cmd->pool, sizeof(urlmap) ) ;
+
+  newmap->next = NULL ;
+  newmap->flags
+       = XREGFLAG(M_HTML,flags,'h')
+       | XREGFLAG(M_EVENTS,flags,'e')
+       | XREGFLAG(M_CDATA,flags,'c')
+       | REGFLAG(M_ATSTART,flags,'^')
+       | REGFLAG(M_ATEND,flags,'$')
+       | REGFLAG(M_REGEX,flags,'R')
+       | REGFLAG(M_LAST,flags,'L')
+  ;
+
+  if ( cfg->map ) {
+    for ( map = cfg->map ; map->next ; map = map->next ) ;
+    map->next = newmap ;
+  } else
+    cfg->map = newmap ;
+
+  if ( ! (newmap->flags & M_REGEX) ) {
+    newmap->from.c = apr_pstrdup(cmd->pool, from) ;
+    newmap->to = apr_pstrdup(cmd->pool, to) ;
+  } else {
+    regflags
+       = REGFLAG(REG_EXTENDED,flags,'x')
+       | REGFLAG(REG_ICASE,flags,'i')
+       | REGFLAG(REG_NOSUB,flags,'n')
+       | REGFLAG(REG_NEWLINE,flags,'s')
+    ;
+    newmap->from.r = ap_pregcomp(cmd->pool, from, regflags) ;
+    newmap->to = apr_pstrdup(cmd->pool, to) ;
+  }
+  return NULL ;
+}
+static const char* set_doctype(cmd_parms* cmd, void* CFG, const char* t,
+       const char* l) {
+  proxy_html_conf* cfg = (proxy_html_conf*)CFG ;
+  if ( !strcasecmp(t, "xhtml") ) {
+    cfg->etag = xhtml_etag ;
+    if ( l && !strcasecmp(l, "legacy") )
+      cfg->doctype = fpi_xhtml_legacy ;
+    else
+      cfg->doctype = fpi_xhtml ;
+  } else if ( !strcasecmp(t, "html") ) {
+    cfg->etag = html_etag ;
+    if ( l && !strcasecmp(l, "legacy") )
+      cfg->doctype = fpi_html_legacy ;
+    else
+      cfg->doctype = fpi_html ;
+  } else {
+    cfg->doctype = apr_pstrdup(cmd->pool, t) ;
+    if ( l && ( ( l[0] == 'x' ) || ( l[0] == 'X' ) ) )
+      cfg->etag = xhtml_etag ;
+    else
+      cfg->etag = html_etag ;
+  }
+  return NULL ;
+}
+static void set_param(proxy_html_conf* cfg, const char* arg) {
+  if ( arg && *arg ) {
+    if ( !strcmp(arg, "lowercase") )
+      cfg->flags |= NORM_LC ;
+    else if ( !strcmp(arg, "dospath") )
+      cfg->flags |= NORM_MSSLASH ;
+    else if ( !strcmp(arg, "reset") )
+      cfg->flags |= NORM_RESET ;
+  }
+}
+static const char* set_flags(cmd_parms* cmd, void* CFG, const char* arg1,
+       const char* arg2, const char* arg3) {
+  set_param( (proxy_html_conf*)CFG, arg1) ;
+  set_param( (proxy_html_conf*)CFG, arg2) ;
+  set_param( (proxy_html_conf*)CFG, arg3) ;
+  return NULL ;
+}
+static const command_rec proxy_html_cmds[] = {
+  AP_INIT_TAKE23("ProxyHTMLURLMap", set_urlmap, NULL,
+       RSRC_CONF|ACCESS_CONF, "Map URL From To" ) ,
+  AP_INIT_TAKE12("ProxyHTMLDoctype", set_doctype, NULL,
+       RSRC_CONF|ACCESS_CONF, "(HTML|XHTML) [Legacy]" ) ,
+  AP_INIT_TAKE123("ProxyHTMLFixups", set_flags, NULL,
+       RSRC_CONF|ACCESS_CONF, "Options are lowercase, dospath" ) ,
+  AP_INIT_FLAG("ProxyHTMLMeta", ap_set_flag_slot,
+       (void*)APR_OFFSETOF(proxy_html_conf, metafix),
+       RSRC_CONF|ACCESS_CONF, "Fix META http-equiv elements" ) ,
+  AP_INIT_FLAG("ProxyHTMLExtended", ap_set_flag_slot,
+       (void*)APR_OFFSETOF(proxy_html_conf, extfix),
+       RSRC_CONF|ACCESS_CONF, "Map URLs in Javascript and CSS" ) ,
+  AP_INIT_FLAG("ProxyHTMLStripComments", ap_set_flag_slot,
+       (void*)APR_OFFSETOF(proxy_html_conf, strip_comments),
+       RSRC_CONF|ACCESS_CONF, "Strip out comments" ) ,
+#ifndef GO_FASTER
+  AP_INIT_FLAG("ProxyHTMLLogVerbose", ap_set_flag_slot,
+       (void*)APR_OFFSETOF(proxy_html_conf, verbose),
+       RSRC_CONF|ACCESS_CONF, "Verbose Logging (use with LogLevel Info)" ) ,
+#endif
+  AP_INIT_TAKE1("ProxyHTMLBufSize", ap_set_int_slot,
+       (void*)APR_OFFSETOF(proxy_html_conf, bufsz),
+       RSRC_CONF|ACCESS_CONF, "Buffer size" ) ,
+  { NULL }
+} ;
+static int mod_proxy_html(apr_pool_t* p, apr_pool_t* p1, apr_pool_t* p2,
+       server_rec* s) {
+  ap_add_version_component(p, VERSION_STRING) ;
+  return OK ;
+}
+static void proxy_html_hooks(apr_pool_t* p) {
+  ap_register_output_filter("proxy-html", proxy_html_filter,
+       NULL, AP_FTYPE_RESOURCE) ;
+  ap_hook_post_config(mod_proxy_html, NULL, NULL, APR_HOOK_MIDDLE) ;
+  ap_hook_child_init(proxy_html_child_init, NULL, NULL, APR_HOOK_MIDDLE) ;
+}
+module AP_MODULE_DECLARE_DATA proxy_html_module = {
+       STANDARD20_MODULE_STUFF,
+       proxy_html_config,
+       proxy_html_merge,
+       NULL,
+       NULL,
+       proxy_html_cmds,
+       proxy_html_hooks
+} ;
author	Emmanuel Lacour <elacour@home-dn.net>
	Sat, 13 Oct 2007 14:27:09 +0000 (14:27 +0000)
committer	Emmanuel Lacour <elacour@home-dn.net>
	Sat, 13 Oct 2007 14:27:09 +0000 (14:27 +0000)
config.html	[new file with mode: 0644]	patch \| blob
guide.html	[new file with mode: 0644]	patch \| blob
mod_proxy_html.c	[new file with mode: 0644]	patch \| blob