mod_proxy_html: Configuration

mod_proxy_html Version 2.4 (Sept 2004) and upwards. Updates in Version 3 (Dec. 2006) are highlighted.

Configuration Directives

The following can be used anywhere in an httpd.conf or included configuration file.

ProxyHTMLURLMap

Syntax: ProxyHTMLURLMap from-pattern to-pattern [flags] [cond]

This is the key directive for rewriting HTML links. When parsing a document, whenever a link target matches from-pattern, the matching portion will be rewritten to to-pattern.

Starting at version 2.0, this supports a wider range of pattern-matching and substitutions, including regular expression search and replace, controlled by the optional third flags argument.

Starting at version 3.0, this also supports environment variable interpolation using the V and v flags, and rules may apply conditionally based on an environment variable. Note that interpolation takes place before the parse starts, so variables set during the parse (e.g. using SSI directives) will not apply. This flexible configuration is enabled by the ProxyHTMLInterp directive, or can be disabled for speed.

Flags for ProxyHTMLURLMap

Flags are case-sensitive.

h

Ignore HTML links (pass through unchanged)

e

Ignore scripting events (pass through unchanged)

c

Pass embedded script and style sections through untouched.

L

Last-match. If this rule matches, no more rules are applied (note that this happens automatically for HTML links).

l
Opposite to L. Overrides the one-change-only default behaviour with HTML links.
R

Use Regular Expression matching-and-replace. from-pattern is a regexp, and to-pattern a replacement string that may be based on the regexp. Regexp memory is supported: you can use brackets () in the from-pattern and retrieve the matches with $1 to $9 in the to-pattern.

If R is not set, it will use string-literal search-and-replace, as in versions 1.x. Logic is starts-with in HTML links, but contains in scripting events and embedded script and style sections.

x

Use POSIX extended Regular Expressions. Only applicable with R.

i

Case-insensitive matching. Only applicable with R.

n

Disable regexp memory (for speed). Only applicable with R.

s

Line-based regexp matching. Only applicable with R.

^

Match at start only. This applies only to string matching (not regexps) and is irrelevant to HTML links.

$

Match at end only. This applies only to string matching (not regexps) and is irrelevant to HTML links.

V

Interpolate environment variables in to-pattern. A string of the form ${varname|default} will be replaced by the value of environment variable varname. If that is unset, it is replaced by default. The |default is optional.

NOTE: interpolation will only be enabled if ProxyHTMLInterp is On.

v

Interpolate environment variables in from-pattern. Patterns supported are as above.

NOTE: interpolation will only be enabled if ProxyHTMLInterp is On.

Conditions for ProxyHTMLURLMap

The optional cond argument specifies a condition to test before the parse. If a condition is unsatisfied, the URLMap will be ignored in this parse.

The condition takes the form [!]var[=val], and is satisfied if the value of environment variable var is val. If the optional =val is omitted, then any value of var satisfies the condition, provided only it is set to something. If the first character is !, the condition is reversed.

NOTE: conditions will only be applied if ProxyHTMLInterp is On.

ProxyHTMLInterp

Syntax: ProxyHTMLInterp On|Off

Enables new (per-request) features of ProxyHTMLURLMap.

ProxyHTMLDoctype

Syntax: ProxyHTMLDoctype HTML|XHTML [Legacy]

Alternative Syntax: ProxyHTMLDocType fpi [SGML|XML]

In the first form, documents will be declared as HTML 4.01 or XHTML 1.0 according to the option selected. This option also determines whether HTML or XHTML syntax is used for output. Note that the format of the documents coming from the backend server is immaterial: the parser will deal with it automatically. If the optional second argument is set to "Legacy", documents will be declared "Transitional", an option that may be necessary if you are proxying pre-1998 content or working with defective authoring/publishing tools.

In the second form, it will insert your own FPI. The optional second argument determines whether SGML/HTML or XML/XHTML syntax will be used.

Starting at version 2.0, the default is changed to omitting any FPI, on the grounds that no FPI is better than a bogus one. If your backend generates decent HTML or XHTML, set it accordingly.

From version 3, if the first form is used, mod_proxy_html will also clean up the HTML to the specified standard. It cannot fix every error, but it will strip out bogus elements and attributes. It will also optionally log other errors at LogLevel Debug.

ProxyHTMLFixups

Syntax: ProxyHTMLFixups [lowercase] [dospath] [reset]

This directive takes one to three arguments as follows:

  • lowercase Urls are rewritten to lowercase
  • dospath Backslashes in URLs are rewritten to forward slashes.
  • reset Unset any options set at a higher level in the configuration.

Take care when using these. The fixes will correct certain authoring mistakes, but risk also erroneously fixing links that were correct to start with. Only use them if you know you have a broken backend server.

ProxyHTMLMeta

Syntax ProxyHTMLMeta [On|Off]

Parses <meta http-equiv ...> elements to real HTTP headers.

In version 3, this is also tied in with the improved internationalisation support, and is required to support some character encodings.

ProxyHTMLExtended

Syntax ProxyHTMLExtended [On|Off]

Set to Off, this gives the same behaviour as 1.x versions of mod_proxy_html. HTML links are rewritten according the ProxyHTMLURLMap directives, but links appearing in Javascript and CSS are ignored.

Set to On, all scripting events and embedded scripts or stylesheets are also processed by the ProxyHTMLURLMap rules, according to the flags set for each rule. Since this requires more parsing, performance will be best if you only enable it when strictly necessary.

ProxyHTMLStripComments

Syntax ProxyHTMLStripComments [On|Off]

This directive will cause mod_proxy_html to strip HTML comments. Note that this will also kill off any scripts or styles embedded in comments (a bogosity introduced in 1995/6 with Netscape 2 for the benefit of then-older browsers, but still in use today). It may also interfere with comment-based processors such as SSI or ESI: be sure to run any of those before mod_proxy_html in the filter chain if stripping comments!

ProxyHTMLLogVerbose

Syntax ProxyHTMLLogVerbose [On|Off]

Turns on verbose logging. This causes mod_proxy_html to make error log entries (at LogLevel Info) about charset detection and about all meta substitutions and rewrites made. When Off, only errors and warnings (if any) are logged.

ProxyHTMLBufSize

Syntax ProxyHTMLBufSize nnnn

Set the buffer size increment for buffering inline stylesheets and scripts.

In order to parse non-HTML content (stylesheets and scripts), mod_proxy_html has to read the entire script or stylesheet into a buffer. This buffer will be expanded as necessary to hold the largest script or stylesheet in a page, in increments of [nnnn] as set by this directive.

The default is 8192, and will work well for almost all pages. However, if you know you're proxying a lot of pages containing stylesheets and/or scripts bigger than 8K (that is, for a single script or stylesheet, NOT in total), it will be more efficient to set a larger buffer size and avoid the need to resize the buffer dynamically during a request.

ProxyHTMLEvents

Syntax ProxyHTMLEvents attr [attr ...]

Specifies one or more attributes to treat as scripting events and apply URLMaps to where appropriate. You can specify any number of attributes in one or more ProxyHTMLEvents directives. The sample configuration defines the events in standard HTML 4 and XHTML 1.

ProxyHTMLLinks

Syntax ProxyHTMLLinks elt attr [attr ...]

Specifies elements that have URL attributes that should be rewritten using standard URLMaps as in versions 1 and 2 of mod_proxy_html. You will need one ProxyHTMLLinks directive per element, but it can have any number of attributes. The sample configuration defines the HTML links for standard HTML 4 and XHTML 1.

ProxyHTMLCharsetAlias

Syntax ProxyHTMLCharsetAlias charset alias [alias ...]

This server-wide directive aliases one or more charset to another charset. This enables encodings not recognised by libxml2 to be handled internally by libxml2's charset support using the translation table for a recognised charset.

For example, Latin 1 (ISO-8859-1) is supported by libxml2. Microsoft's Windows-1252 is almost identical and can be supported by aliasing it:
ProxyHTMLCharsetAlias ISO-8859-1 Windows-1252

ProxyHTMLCharsetDefault

Syntax ProxyHTMLCharsetDefault name

This defines the default encoding to assume when absolutely no charset information is available from the backend server. The default value for this is ISO-8859-1, as specified in HTTP/1.0 and assumed in earlier mod_proxy_html versions.

ProxyHTMLCharsetOut

Syntax ProxyHTMLCharsetOut name

This selects an encoding for mod_proxy_html output. It should not normally be used, as any change from the default UTF-8 (Unicode - as used internally by libxml2) will impose an additional processing overhead. The special token ProxyHTMLCharsetOut * will generate output using the same encoding as the input.

ProxyHTMLStartParse

Syntax ProxyHTMLStartParse element [elt*]

Specify that the HTML parser should start at the first instance of any of the elements specified. This can be used where a broken backend inserts leading junk that messes up the parser (example here).

Other Configuration

Normally, mod_proxy_html will refuse to run when not in a proxy or when the contents are not HTML. This can be overridden (at your own risk) by setting the environment variable PROXY_HTML_FORCE (e.g. with the SetEnv directive).