From 5e5032129599538ea115b63e68b24227a80d2491 Mon Sep 17 00:00:00 2001 From: Emmanuel Lacour Date: Sat, 21 Nov 2009 01:42:20 +0100 Subject: [PATCH] New upstream release (3.1.2) --- README | 17 +- config.html | 101 ++++++------ faq.html | 61 ++++--- guide.html | 72 ++++----- mod_proxy_html.c | 483 ++++++++++++------------------------------------------- proxy_html.conf | 31 +++- 6 files changed, 263 insertions(+), 502 deletions(-) diff --git a/README b/README index 2c2f65d..086a344 100644 --- a/README +++ b/README @@ -3,11 +3,20 @@ DOCUMENTATION for this module is at UPGRADING: IMPORTANT NOTE -If you are upgrading from mod_proxy_html 2.x (or 1.x), you will need -some new configuration. You can Include proxy_html.conf from this -bundle in your httpd.conf (or apache.conf) to use Version 3 as a -drop-in replacement for Version 2. +Upgrading from 3.0 or any earlier version, you may need to: + (1) Load mod_xml2enc + (2) Use the new "ProxyHTMLEnable On" directive in place of + Apache's general-purpose filter configuration (such as + SetOutputFilter or FilterChain). +Without these it'll work fine with ASCII or Unicode utf-8, +but is likely to display characters incorrectly with other +character encodings. +If you are upgrading from mod_proxy_html 2.x (or 1.x), you will need +to configure what HTML elements should be treated as links and events. +The configuration file "proxy_html.conf" loaded into your httpd.conf +(or apache.conf, apache2.conf or similar according to packager's whims) +does this if you're dealing with standard/W3C HTML 4 and/or XHTML 1. WINDOWS USERS: diff --git a/config.html b/config.html index f2fcacd..87d7f0d 100644 --- a/config.html +++ b/config.html @@ -8,12 +8,19 @@

mod_proxy_html: Configuration

-

mod_proxy_html Version 2.4 (Sept 2004) and upwards. -Updates in Version 3 (Dec. 2006) are highlighted.

+

mod_proxy_html Version 3.1 (April 2009).

Configuration Directives

The following can be used anywhere in an httpd.conf or included configuration file.

+
ProxyHTMLEnable
+
+

Syntax: ProxyHTMLEnable On|Off

+

Enables mod_proxy_html filtering in a scope (<Location> +or top level/virtualhost). This also configures mod_xml2enc if present, and replaces use of +any generic filter configuration (e.g. SetOutputFilter +or FilterProvider) to configure both these modules.

ProxyHTMLURLMap

Syntax: @@ -25,7 +32,7 @@ portion will be rewritten to to-pattern.

and substitutions, including regular expression search and replace, controlled by the optional third flags argument.

-

Starting at version 3.0, this also supports environment variable +Starting at version 3.0, this also supports environment variable interpolation using the V and v flags, and rules may apply conditionally based on an environment variable. Note that interpolation takes place before the parse starts, so variables set during the parse (e.g. @@ -44,9 +51,9 @@ be disabled for speed.

L

Last-match. If this rule matches, no more rules are applied (note that this happens automatically for HTML links).

-
l
-
Opposite to L. Overrides the one-change-only default -behaviour with HTML links.
+
l
+

Opposite to L. Overrides the one-change-only default +behaviour with HTML links.

R

Use Regular Expression matching-and-replace. from-pattern is a regexp, and to-pattern a replacement string that may be @@ -72,33 +79,33 @@ versions 1.x. Logic is starts-with in HTML links, but

$

Match at end only. This applies only to string matching (not regexps) and is irrelevant to HTML links.

-
V
-

Interpolate environment variables in to-pattern. +

V
+

Interpolate environment variables in to-pattern. A string of the form ${varname|default} will be replaced by the value of environment variable varname. If that is unset, it is replaced by default. The |default is optional.

NOTE: interpolation will only be enabled if ProxyHTMLInterp is On.

-
v
-

Interpolate environment variables in from-pattern. +

v
+

Interpolate environment variables in from-pattern. Patterns supported are as above.

NOTE: interpolation will only be enabled if ProxyHTMLInterp is On.

-

Conditions for ProxyHTMLURLMap

-

The optional cond argument specifies a condition to +

Conditions for ProxyHTMLURLMap

+

The optional cond argument specifies a condition to test before the parse. If a condition is unsatisfied, the URLMap will be ignored in this parse.

-

The condition takes the form [!]var[=val], and is +

The condition takes the form [!]var[=val], and is satisfied if the value of environment variable var is val. If the optional =val is omitted, then any value of var satisfies the condition, provided only it is set to something. If the first character is !, the condition is reversed.

-

NOTE: conditions will only be applied if ProxyHTMLInterp is On.

+

NOTE: conditions will only be applied if ProxyHTMLInterp is On.

-
ProxyHTMLInterp
-
+
ProxyHTMLInterp
+

Syntax: ProxyHTMLInterp On|Off

Enables new (per-request) features of ProxyHTMLURLMap.

@@ -120,7 +127,7 @@ argument determines whether SGML/HTML or XML/XHTML syntax will be used.

Starting at version 2.0, the default is changed to omitting any FPI, on the grounds that no FPI is better than a bogus one. If your backend generates decent HTML or XHTML, set it accordingly.

-

From version 3, if the first form is used, mod_proxy_html +

From version 3, if the first form is used, mod_proxy_html will also clean up the HTML to the specified standard. It cannot fix every error, but it will strip out bogus elements and attributes. It will also optionally log other errors at LogLevel Debug.

@@ -142,7 +149,7 @@ Only use them if you know you have a broken backend server.

Syntax ProxyHTMLMeta [On|Off]

Parses <meta http-equiv ...> elements to real HTTP headers.

-

In version 3, this is also tied in with the improved +

In version 3, this is also tied in with the improved internationalisation support, and is required to support some character encodings.

@@ -187,8 +194,8 @@ NOT in total), it will be more efficient to set a larger buffer size and avoid the need to resize the buffer dynamically during a request.

-
ProxyHTMLEvents
-
+
ProxyHTMLEvents
+

Syntax ProxyHTMLEvents attr [attr ...]

Specifies one or more attributes to treat as scripting events and apply URLMaps to where appropriate. You can specify any number of @@ -196,8 +203,8 @@ attributes in one or more ProxyHTMLEvents directives. The sample configuration defines the events in standard HTML 4 and XHTML 1.

-
ProxyHTMLLinks
-
+
ProxyHTMLLinks
+

Syntax ProxyHTMLLinks elt attr [attr ...]

Specifies elements that have URL attributes that should be rewritten using standard URLMaps as in versions 1 and 2 of mod_proxy_html. @@ -206,28 +213,8 @@ but it can have any number of attributes. The sample configuration defines the HTML links for standard HTML 4 and XHTML 1.

-
ProxyHTMLCharsetAlias
-
-

Syntax ProxyHTMLCharsetAlias charset alias [alias ...]

-

This server-wide directive aliases one or more charset to another -charset. This enables encodings not recognised by libxml2 to be handled -internally by libxml2's charset support using the translation table for -a recognised charset.

-

For example, Latin 1 (ISO-8859-1) is supported by libxml2. -Microsoft's Windows-1252 is almost identical and can be supported -by aliasing it:
-ProxyHTMLCharsetAlias ISO-8859-1 Windows-1252

-
-
ProxyHTMLCharsetDefault
-
-

Syntax ProxyHTMLCharsetDefault name

-

This defines the default encoding to assume when absolutely no charset -information is available from the backend server. The default value for -this is ISO-8859-1, as specified in HTTP/1.0 and assumed in -earlier mod_proxy_html versions.

-
-
ProxyHTMLCharsetOut
-
+
ProxyHTMLCharsetOut
+

Syntax ProxyHTMLCharsetOut name

This selects an encoding for mod_proxy_html output. It should not normally be used, as any change from the default UTF-8 @@ -235,20 +222,24 @@ normally be used, as any change from the default UTF-8 processing overhead. The special token ProxyHTMLCharsetOut * will generate output using the same encoding as the input.

-
ProxyHTMLStartParse
-
-

Syntax ProxyHTMLStartParse element [elt*]

-

Specify that the HTML parser should start at the first instance -of any of the elements specified. This can be used where a broken -backend inserts leading junk that messes up the parser (example here).

-
+
ProxyHTMLCharsetAlias
+
ProxyHTMLCharsetDefault
+
ProxyHTMLStartParse
+

These directives from Version 3.0 are replaced by +mod_xml2enc.

Other Configuration

-

Normally, mod_proxy_html will refuse to run when not +

Normally, mod_proxy_html will refuse to run when not in a proxy or when the contents are not HTML. This can be overridden (at your own risk) by setting the environment variable PROXY_HTML_FORCE (e.g. with the SetEnv directive).

- + diff --git a/faq.html b/faq.html index 86cfe03..ced1feb 100644 --- a/faq.html +++ b/faq.html @@ -22,17 +22,18 @@ Version 2, and most of the questions are moot in Version 3.

Answers

Can mod_proxy_html support (charset XYZ) as input?
-

That depends entirely on libxml2. mod_proxy_html supports -charset detection, but does not itself support any charsets. -It works by passing the charset detected to libxml2 when it sets -up the parser.

-

This means that mod_proxy_html inherits its charset support -from libxml2, and will always support exactly the same -charsets available in the version of libxml2 you have installed. -So bug the libxml2 folks, not us!

-

In Version 3, charset support is much expanded provided -ProxyHTMLMeta is enabled, and any charset can be supported -by aliasing it with ProxyHTMLCharsetAlias.

+

In version 2, that depends entirely on libxml2, and your charset +is supported if and only if libxml2 supports it.

+

In Version 3.1, charset support is much expanded provided +mod_xml2enc is enabled. It is normally +sufficient just to load mod_xml2enc: it will be configured automatically +if you configure mod_proxy_html using ProxyHTMLEnable. +In a few cases, you may need to customise charset support further using +mod_xml2enc's directives.

+

Note that some servers send inconsistent and even conflicting charset +information, and may generate unexpected results. Setting +ProxyHTMLMeta On may help resolve such cases, and will +help diagnose problems with extra debug information in the error log.

Can mod_proxy_html support (charset XYZ) as output?

libxml2 uses utf-8 internally for everything. @@ -40,8 +41,9 @@ Generating output with another charset is therefore an additional overhead, and the decision was taken to exclude any such capability from mod_proxy_html. There is an easy workaround: you can transcode the output using another filter, such as mod_charset_lite.

-

Version 3 supports output transformation to other -charsets using ProxyHTMLCharsetOut.

+

mod_proxy_html 3 supports output transformation to other +charsets using ProxyHTMLCharsetOut. This requires +mod_xml2enc to be loaded.

Why does mod_proxy_html mangle my Javascript?

It doesn't. Your javascript is simply too badly malformed, @@ -53,22 +55,37 @@ or with libxml2's xmllint --html

The best fix for this is to remove the javascript from your markup, and import it from a separate .js file. If you have an irredeemably broken publishing system, you may have to upgrade to -mod_publisher or resort to the -non-markup-aware mod_line_edit.

+mod_publisher or resort to a markup-blind +filter such as mod_line_edit, +mod_substitute or mod_sed.

Why doesn't mod_proxy_html rewrite urls in [some attribute]?
-

mod_proxy_html is based on W3C HTML 4.01 and XHTML 1.0 (which are -identical in terms of elements and attributes). It supports all links +

mod_proxy_html versions 1 and 2 are based on W3C HTML 4.01 and +XHTML 1.0 (which are identical in terms of elements and attributes). +It supports all links defined in W3C HTML, even those that have been deprecated since 1997. But it does NOT support proprietary pseudo-HTML "extensions" that have never been part of any published HTML standard. Of course, it's trivial to add them to the source.

This has been the most commonly requested feature since mod_proxy_html 2.0 -was released in 2004. It cannot reasonably be satisfied, because everyone's -pet "extensions" are different. Version 3 deals with this -by taking all HTML knowledge out of the code and loading it from httpd.conf -instead, so admins can meet their own needs without recompiling.

+was released in 2004. Since everyone's requirements are different, it +could not reasonably be satisfied with a simple one-size-fits-all fix. +Version 3 of mod_proxy_html delegates the definition of HTML links to +the system administrator, via the configuration file. +

A sample file proxy_html.conf is provided, and defines +standard W3C HTML/XHTML. Note that you MUST include this (or equivalent) +into your configuration, or no links will be rewritten! If you need to +support nonstandard HTML variants, follow the instructions in +proxy_html.conf.

- + diff --git a/guide.html b/guide.html index 017c4eb..4f86c62 100644 --- a/guide.html +++ b/guide.html @@ -11,8 +11,7 @@

mod_proxy_html: Technical Guide

-

mod_proxy_html From Version 2.4 (Sept 2004). -Updates in Version 3 (Dec. 2006) are highlighted.

+

mod_proxy_html Version 3.1 (April 2009).

Contents

- + diff --git a/mod_proxy_html.c b/mod_proxy_html.c index 0157c2f..6a97d3e 100644 --- a/mod_proxy_html.c +++ b/mod_proxy_html.c @@ -1,5 +1,5 @@ /******************************************************************** - Copyright (c) 2003-8, WebThing Ltd + Copyright (c) 2003-9, WebThing Ltd Author: Nick Kew This program is free software; you can redistribute it and/or modify @@ -17,22 +17,11 @@ http://apache.webthing.com/COPYING.txt *********************************************************************/ - -/******************************************************************** - Note to Users - - You are requested to register as a user, at - http://apache.webthing.com/registration.html - - This entitles you to support from the developer. - I'm unlikely to reply to help/support requests from - non-registered users, unless you're paying and/or offering - constructive feedback such as bug reports or sensible - suggestions for further development. - - It also makes a small contribution to the effort - that's gone into developing this work. -*********************************************************************/ +/**** NOTICE TO PACKAGERS + * + * This module now relies on mod_xml2enc for i18n support. + * You should make mod_xml2enc a dependency in your packages. + */ /* End of Notices */ @@ -57,7 +46,8 @@ http://apache.webthing.com/COPYING.txt #define VERBOSEB(x) if (verbose) {x} #endif -#define VERSION_STRING "proxy_html/3.0.1" +/* 3.1.2 - trivial changes to fix compile on Windows */ +#define VERSION_STRING "proxy_html/3.1.2" #include @@ -70,7 +60,11 @@ http://apache.webthing.com/COPYING.txt #include #include #include -#include +#include + +#include +#include +#include /* To support Apache 2.1/2.2, we need the ap_ forms of the * regexp stuff, and they're now used in the code. @@ -91,6 +85,12 @@ http://apache.webthing.com/COPYING.txt #define APACHE22 #endif +/* globals set once at startup */ +static ap_regex_t* seek_meta ; +static const apr_strmatch_pattern* seek_content ; +static apr_status_t (*xml2enc_charset)(request_rec*, xmlCharEncoding*, const char**) = NULL; +static apr_status_t (*xml2enc_filter)(request_rec*, const char*, unsigned int) = NULL; + module AP_MODULE_DECLARE_DATA proxy_html_module ; #define M_HTML 0x01 @@ -135,23 +135,17 @@ typedef struct { size_t bufsz ; apr_hash_t* links; apr_array_header_t* events; - apr_array_header_t* skipto; - xmlCharEncoding default_encoding; const char* charset_out; int extfix ; int metafix ; int strip_comments ; int interp; + int enabled; #ifndef GO_FASTER int verbose ; #endif } proxy_html_conf ; typedef struct { - apr_xlate_t* convset; - char* buf; - apr_size_t bytes; -} conv_t; -typedef struct { ap_filter_t* f ; proxy_html_conf* cfg ; htmlParserCtxtPtr parser ; @@ -159,8 +153,6 @@ typedef struct { char* buf ; size_t offset ; size_t avail ; - conv_t* conv_in; - conv_t* conv_out; const char* encoding; urlmap* map; } saxctxt ; @@ -195,132 +187,15 @@ static void normalise(unsigned int flags, char* str) { *p = tolower(*p) ; if ( flags & NORM_MSSLASH ) - for ( p = ap_strchr_c(str, '\\') ; p ; p = ap_strchr_c(p+1, '\\') ) + for ( p = ap_strchr(str, '\\') ; p ; p = ap_strchr(p+1, '\\') ) *p = '/' ; } -static void consume_buffer(saxctxt* ctx, const char* inbuf, - int bytes, int flag) { - apr_status_t rv; - apr_size_t insz; - char* buf; -#ifndef GO_FASTER - int verbose = ctx->cfg->verbose; -#endif - if (ctx->conv_in == NULL) { - /* just feed it to libxml2 */ - htmlParseChunk(ctx->parser, inbuf, bytes, flag) ; - return; - } - if (ctx->conv_in->bytes > 0) { - /* FIXME: make this a reusable buf? */ - buf = apr_palloc(ctx->f->r->pool, ctx->conv_in->bytes + bytes); - memcpy(buf, ctx->conv_in->buf, ctx->conv_in->bytes); - memcpy(buf + ctx->conv_in->bytes, inbuf, bytes); - bytes += ctx->conv_in->bytes; - ctx->conv_in->bytes = 0; - } else { - buf = (char*) inbuf; - } - insz = bytes; - while (insz > 0) { - char outbuf[4096]; - apr_size_t outsz = 4096; - rv = apr_xlate_conv_buffer(ctx->conv_in->convset, - buf + (bytes - insz), &insz, - outbuf, &outsz); - htmlParseChunk(ctx->parser, outbuf, 4096-outsz, flag) ; - switch (rv) { - case APR_SUCCESS: - continue; - case APR_EINCOMPLETE: - if (insz < 32) {/* save dangling byte(s) and return */ - ctx->conv_in->bytes = insz; - ctx->conv_in->buf = (buf != inbuf) ? buf + (bytes-insz) - : apr_pmemdup(ctx->f->r->pool, buf + (bytes-insz), insz); - return; - } else { /*OK, maybe 4096 wasn't big enough, and ended mid-char */ - continue; - } - case APR_EINVAL: /* try skipping one bad byte */ - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, ctx->f->r, - "Skipping invalid byte in input stream!") ) ; - --insz; - continue; - default: - /* Erk! What's this? Bail out and eat the buf raw - * if libxml2 will accept it! - */ - ap_log_rerror(APLOG_MARK, APLOG_ERR, rv, ctx->f->r, - "Failed to convert input; trying it raw") ; - htmlParseChunk(ctx->parser, buf + (bytes - insz), insz, flag) ; - ctx->conv_in = NULL; /* don't try converting any more */ - return; - } - } -} -static void AP_fwrite(saxctxt* ctx, const char* inbuf, int bytes, int flush) { - /* convert charset if necessary, and output */ - char* buf; - apr_status_t rv; - apr_size_t insz ; -#ifndef GO_FASTER - int verbose = ctx->cfg->verbose; -#endif +#define consume_buffer(ctx,inbuf,bytes,flag) \ + htmlParseChunk(ctx->parser, inbuf, bytes, flag) - if (ctx->conv_out == NULL) { - ap_fwrite(ctx->f->next, ctx->bb, inbuf, bytes); - return; - } - if (ctx->conv_out->bytes > 0) { - /* FIXME: make this a reusable buf? */ - buf = apr_palloc(ctx->f->r->pool, ctx->conv_out->bytes + bytes); - memcpy(buf, ctx->conv_out->buf, ctx->conv_out->bytes); - memcpy(buf + ctx->conv_out->bytes, inbuf, bytes); - bytes += ctx->conv_out->bytes; - ctx->conv_out->bytes = 0; - } else { - buf = (char*) inbuf; - } - insz = bytes; - while (insz > 0) { - char outbuf[2048]; - apr_size_t outsz = 2048; - rv = apr_xlate_conv_buffer(ctx->conv_out->convset, - buf + (bytes - insz), &insz, - outbuf, &outsz); - ap_fwrite(ctx->f->next, ctx->bb, outbuf, 2048-outsz) ; - switch (rv) { - case APR_SUCCESS: - continue; - case APR_EINCOMPLETE: /* save dangling byte(s) and return */ - /* but if we need to flush, just abandon them */ - if ( flush) { /* if we're flushing, this must be complete */ - /* so this is an error */ - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, ctx->f->r, - "Skipping invalid byte in output stream!") ) ; - } else { - ctx->conv_out->bytes = insz; - ctx->conv_out->buf = (buf != inbuf) ? buf + (bytes-insz) - : apr_pmemdup(ctx->f->r->pool, buf + (bytes-insz), insz); - } - break; - case APR_EINVAL: /* try skipping one bad byte */ - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, ctx->f->r, - "Skipping invalid byte in output stream!") ) ; - --insz; - continue; - default: - /* Erk! What's this? Bail out and pass the buf raw - * if libxml2 will accept it! - */ - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_ERR, rv, ctx->f->r, - "Failed to convert output; sending UTF-8") ) ; - ap_fwrite(ctx->f->next, ctx->bb, buf + (bytes - insz), insz) ; - break; - } - } -} +#define AP_fwrite(ctx,inbuf,bytes,flush) \ + ap_fwrite(ctx->f->next, ctx->bb, inbuf, bytes); /* This is always utf-8 on entry. We can convert charset within FLUSH */ #define FLUSH AP_fwrite(ctx, (chars+begin), (i-begin), 0) ; begin = i+1 @@ -350,9 +225,9 @@ static void preserve(saxctxt* ctx, const size_t len) { newbuf = realloc(ctx->buf, ctx->avail) ; if ( newbuf != ctx->buf ) { if ( ctx->buf ) - apr_pool_cleanup_kill(ctx->f->r->pool, ctx->buf, (void*)free) ; + apr_pool_cleanup_kill(ctx->f->r->pool, ctx->buf, (int(*)(void*))free); apr_pool_cleanup_register(ctx->f->r->pool, newbuf, - (void*)free, apr_pool_cleanup_null); + (int(*)(void*))free, apr_pool_cleanup_null); ctx->buf = newbuf ; } } @@ -614,7 +489,7 @@ static void pstartElement(void* ctxt, const xmlChar* uname, ++num_match ; offs = match = pmatch[0].rm_so ; s_from = pmatch[0].rm_eo - match ; - subs = ap_pregsub(ctx->f->r->pool, m->to, ctx->buf+offs, + subs = ap_pregsub(ctx->f->r->pool, m->to, ctx->buf, nmatch, pmatch) ; VERBOSE( { const char* f = apr_pstrndup(ctx->f->r->pool, @@ -765,100 +640,6 @@ static void pstartElement(void* ctxt, const xmlChar* uname, } } -/* globals set once at startup */ -static ap_regex_t* seek_meta_ctype ; -static ap_regex_t* seek_charset ; -static ap_regex_t* seek_meta ; - -static xmlCharEncoding sniff_encoding(saxctxt* ctx, const char* cbuf, - size_t bytes) { -#ifndef GO_FASTER - int verbose = ctx->cfg->verbose; -#endif - request_rec* r = ctx->f->r ; - proxy_html_conf* cfg = ctx->cfg ; - xmlCharEncoding ret ; - char* p ; - ap_regmatch_t match[2] ; - char* buf = (char*)cbuf ; - apr_xlate_t* convset; - - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, - "Content-Type is %s", r->content_type) ) ; - -/* If we've got it in the HTTP headers, there's nothing to do */ - if ( r->content_type && - ( p = ap_strcasestr(r->content_type, "charset=") , p > 0 ) ) { - p += 8 ; - if ( ctx->encoding = apr_pstrndup(r->pool, p, strcspn(p, " ;") ) , - ctx->encoding ) { - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, - "Got charset %s from HTTP headers", ctx->encoding) ) ; - if ( ret = xmlParseCharEncoding(ctx->encoding), - ((ret != XML_CHAR_ENCODING_ERROR ) - && (ret != XML_CHAR_ENCODING_NONE))) { - return ret ; - } - } - } - -/* to sniff, first we look for BOM */ - if (ctx->encoding == NULL) { - if ( ret = xmlDetectCharEncoding((const xmlChar*)buf, bytes), - ret != XML_CHAR_ENCODING_NONE ) { - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, - "Got charset from XML rules.") ) ; - return ret ; - } - -/* If none of the above, look for a META-thingey */ - if ( ap_regexec(seek_meta_ctype, buf, 1, match, 0) == 0 ) { - p = apr_pstrndup(r->pool, buf + match[0].rm_so, - match[0].rm_eo - match[0].rm_so) ; - if ( ap_regexec(seek_charset, p, 2, match, 0) == 0 ) - ctx->encoding = apr_pstrndup(r->pool, p+match[1].rm_so, - match[1].rm_eo - match[1].rm_so) ; - } - } - -/* either it's set to something we found or it's still the default */ - if ( ctx->encoding ) { - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, - "Got charset %s from HTML META", ctx->encoding) ) ; - if ( ret = xmlParseCharEncoding(ctx->encoding), - ((ret != XML_CHAR_ENCODING_ERROR ) - && (ret != XML_CHAR_ENCODING_NONE))) { - return ret ; - } -/* Unsupported charset. Can we get (iconv) support through apr_xlate? */ -/* Aaargh! libxml2 has undocumented support. So this fails - * if metafix is not active. Have to make it conditional. - */ - if (cfg->metafix) { - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, r, - "Charset %s not supported by libxml2; trying apr_xlate", ctx->encoding) ) ; - if (apr_xlate_open(&convset, "UTF-8", ctx->encoding, r->pool) == APR_SUCCESS) { - ctx->conv_in = apr_pcalloc(r->pool, sizeof(conv_t)); - ctx->conv_in->convset = convset ; - return XML_CHAR_ENCODING_UTF8 ; - } else { - ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, - "Charset %s not supported. Consider aliasing it?", ctx->encoding) ; - } - } else { - ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, - "Charset %s not supported. Consider aliasing it or use metafix?", - ctx->encoding) ; - } - } - - -/* Use configuration default as a last resort */ - ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, r, - "No usable charset information; using configuration default") ; - return (cfg->default_encoding == XML_CHAR_ENCODING_NONE) - ? XML_CHAR_ENCODING_8859_1 : cfg->default_encoding ; -} static meta* metafix(request_rec* r, const char* buf /*, size_t bytes*/ #ifndef GO_FASTER , int verbose @@ -882,21 +663,26 @@ static meta* metafix(request_rec* r, const char* buf /*, size_t bytes*/ header = apr_pstrndup(r->pool, p, q-p) ; if ( strncasecmp(header, "Content-", 8) ) { /* find content=... string */ - for ( p = ap_strstr((char*)buf+offs+pmatch[0].rm_so, "content") ; *p ; ) { - p += 7 ; - while ( *p && isspace(*p) ) - ++p ; - if ( *p != '=' ) - continue ; - while ( *p && isspace(*++p) ) ; - if ( ( *p == '\'' ) || ( *p == '"' ) ) { - delim = *p++ ; - for ( q = p ; *q != delim ; ++q ) ; - } else { - for ( q = p ; *q && !isspace(*q) && (*q != '>') ; ++q ) ; - } - content = apr_pstrndup(r->pool, p, q-p) ; - break ; + p = apr_strmatch(seek_content, buf+offs+pmatch[0].rm_so, + pmatch[0].rm_eo - pmatch[0].rm_so); + /* if it doesn't contain "content", ignore, don't crash! */ + if (p != NULL) { + while (*p) { + p += 7 ; + while ( *p && isspace(*p) ) + ++p ; + if ( *p != '=' ) + continue ; + while ( *p && isspace(*++p) ) ; + if ( ( *p == '\'' ) || ( *p == '"' ) ) { + delim = *p++ ; + for ( q = p ; *q != delim ; ++q ) ; + } else { + for ( q = p ; *q && !isspace(*q) && (*q != '>') ; ++q ) ; + } + content = apr_pstrndup(r->pool, p, q-p) ; + break ; + } } } else if ( !strncasecmp(header, "Content-Type", 12) ) { ret = apr_palloc(r->pool, sizeof(meta) ) ; @@ -938,11 +724,12 @@ static const char* interpolate_vars(request_rec* r, const char* str) { var = apr_pstrndup(r->pool, start+2, end-start-2) ; } replacement = apr_table_get(r->subprocess_env, var) ; - if (!replacement) + if (!replacement) { if (delim) replacement = apr_pstrndup(r->pool, delim+1, end-delim-1); else replacement = ""; + } str = apr_pstrcat(r->pool, before, replacement, after, NULL); ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, r, "Interpolating %s => %s", var, replacement) ; @@ -1033,7 +820,7 @@ static saxctxt* check_filter_init (ap_filter_t* f) { if ( errmsg ) { #ifndef GO_FASTER if ( cfg->verbose ) { - ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r, errmsg) ; + ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r, "%s", errmsg) ; } #endif ap_remove_output_filter(f) ; @@ -1057,8 +844,6 @@ static saxctxt* check_filter_init (ap_filter_t* f) { return f->ctx ; } static int proxy_html_filter(ap_filter_t* f, apr_bucket_brigade* bb) { - apr_xlate_t* convset; - const char* charset = NULL; apr_bucket* b ; meta* m = NULL ; xmlCharEncoding enc ; @@ -1101,69 +886,28 @@ static int proxy_html_filter(ap_filter_t* f, apr_bucket_brigade* bb) { } else if ( apr_bucket_read(b, &buf, &bytes, APR_BLOCK_READ) == APR_SUCCESS ) { if ( ctxt->parser == NULL ) { - if ( buf[bytes] != 0 ) { - /* make a string for parse routines to play with */ - char* buf1 = apr_palloc(f->r->pool, bytes+1) ; - memcpy(buf1, buf, bytes) ; - buf1[bytes] = 0 ; - buf = buf1 ; - } - /* For publishing systems that insert crap at the head of a - * page that buggers up the parser. Search to first instance - * of some relatively sane, or at least parseable, element. - */ - if (ctxt->cfg->skipto != NULL) { - char* p = ap_strchr_c(buf, '<'); - tattr* starts = (tattr*) ctxt->cfg->skipto->elts; - int found = 0; - while (!found && *p) { - int i; - for (i = 0; i < ctxt->cfg->skipto->nelts; ++i) { - if ( !strncasecmp(p+1, starts[i].val, strlen(starts[i].val))) { - bytes -= (p-buf); - buf = p ; - found = 1; - VERBOSE( - ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, f->r, - "Skipped to first <%s> element", starts[i].val) - ) ; - break; - } - } - p = ap_strchr_c(p+1, '<'); - } - if (p == NULL) { - ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, f->r, - "Failed to find start of recognised HTML!") ; - } - } - - enc = sniff_encoding(ctxt, buf, bytes) ; - /* now we have input charset, set output charset too */ - if (ctxt->cfg->charset_out) { - if (!strcmp(ctxt->cfg->charset_out, "*")) - charset = ctxt->encoding; - else - charset = ctxt->cfg->charset_out; - if (strcasecmp(charset, "utf-8")) { - if (apr_xlate_open(&convset, charset, "UTF-8", - f->r->pool) == APR_SUCCESS) { - ctxt->conv_out = apr_pcalloc(f->r->pool, sizeof(conv_t)); - ctxt->conv_out->convset = convset; - } else { - ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, f->r, - "Output charset %s not supported. Falling back to UTF-8", - charset) ; - } - } - } - if (ctxt->conv_out) { - const char* ctype = apr_psprintf(f->r->pool, - "text/html;charset=%s", charset); - ap_set_content_type(f->r, ctype) ; - } else { + const char* cenc; + if (!xml2enc_charset || + (xml2enc_charset(f->r, &enc, &cenc) != APR_SUCCESS)) { + if (!xml2enc_charset) + ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, f->r, + "No i18n support found. Install mod_xml2enc if required") ; + enc = XML_CHAR_ENCODING_NONE; ap_set_content_type(f->r, "text/html;charset=utf-8") ; + } else { + /* if we wanted a non-default charset_out, insert the + * xml2enc filter now that we've sniffed it + */ + if (ctxt->cfg->charset_out && xml2enc_filter) { + if (*ctxt->cfg->charset_out != '*') + cenc = ctxt->cfg->charset_out; + xml2enc_filter(f->r, cenc, ENCIO_OUTPUT); + ap_set_content_type(f->r, + apr_pstrcat(f->r->pool, "text/html;charset=", cenc, NULL)) ; + } else /* Normal case, everything worked, utf-8 output */ + ap_set_content_type(f->r, "text/html;charset=utf-8") ; } + ap_fputs(f->next, ctxt->bb, ctxt->cfg->doctype) ; ctxt->parser = htmlCreatePushParserCtxt(&sax, ctxt, buf, 4, 0, enc) ; buf += 4; @@ -1174,7 +918,7 @@ static int proxy_html_filter(ap_filter_t* f, apr_bucket_brigade* bb) { return rv; } apr_pool_cleanup_register(f->r->pool, ctxt->parser, - (void*)htmlFreeParserCtxt, apr_pool_cleanup_null) ; + (int(*)(void*))htmlFreeParserCtxt, apr_pool_cleanup_null) ; #ifndef USE_OLD_LIBXML2 if ( xmlopts = xmlCtxtUseOptions(ctxt->parser, xmlopts ), xmlopts ) ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, f->r, @@ -1209,7 +953,6 @@ static void* proxy_html_config(apr_pool_t* pool, char* x) { ret->doctype = DEFAULT_DOCTYPE ; ret->etag = DEFAULT_ETAG ; ret->bufsz = 8192 ; - ret->default_encoding = XML_CHAR_ENCODING_NONE ; /* ret->interp = 1; */ /* don't initialise links and events until they get set/used */ return ret ; @@ -1223,8 +966,6 @@ static void* proxy_html_merge(apr_pool_t* pool, void* BASE, void* ADD) { conf->links = (add->links == NULL) ? base->links : add->links; conf->events = (add->events == NULL) ? base->events : add->events; - conf->default_encoding = (add->default_encoding == XML_CHAR_ENCODING_NONE) - ? base->default_encoding : add->default_encoding ; conf->charset_out = (add->charset_out == NULL) ? base->charset_out : add->charset_out ; @@ -1254,7 +995,7 @@ static void* proxy_html_merge(apr_pool_t* pool, void* BASE, void* ADD) { conf->extfix = add->extfix ; conf->interp = add->interp ; conf->strip_comments = add->strip_comments ; - conf->skipto = add->skipto ; + conf->enabled = add->enabled; #ifndef GO_FASTER conf->verbose = add->verbose ; #endif @@ -1264,7 +1005,7 @@ static void* proxy_html_merge(apr_pool_t* pool, void* BASE, void* ADD) { conf->extfix = base->extfix | add->extfix ; conf->interp = base->interp | add->interp ; conf->strip_comments = base->strip_comments | add->strip_comments ; - conf->skipto = add->skipto ? add->skipto : base->skipto ; + conf->enabled = add->enabled | base->enabled; #ifndef GO_FASTER conf->verbose = base->verbose | add->verbose ; #endif @@ -1303,16 +1044,17 @@ static void comp_urlmap(apr_pool_t* pool, urlmap* newmap, newmap->to = to ; } if (cond != NULL) { + char* cond_copy; newmap->cond = apr_pcalloc(pool, sizeof(rewritecond)); if (cond[0] == '!') { newmap->cond->rel = -1; - newmap->cond->env = cond+1; + newmap->cond->env = cond_copy = apr_pstrdup(pool, cond+1); } else { newmap->cond->rel = 1; - newmap->cond->env = cond; + newmap->cond->env = cond_copy = apr_pstrdup(pool, cond); } - eq = ap_strchr_c(++cond, '='); - if (eq && (eq != cond)) { + eq = ap_strchr(++cond_copy, '='); + if (eq) { *eq = 0; newmap->cond->val = eq+1; } @@ -1400,15 +1142,6 @@ static const char* set_events(cmd_parms* cmd, void* CFG, const char* arg) { attr->val = arg; return NULL ; } -static const char* set_skipto(cmd_parms* cmd, void* CFG, const char* arg) { - tattr* attr; - proxy_html_conf* cfg = CFG; - if (cfg->skipto == NULL) - cfg->skipto = apr_array_make(cmd->pool, 4, sizeof(tattr)); - attr = apr_array_push(cfg->skipto) ; - attr->val = arg; - return NULL ; -} static const char* set_links(cmd_parms* cmd, void* CFG, const char* elt, const char* att) { apr_array_header_t* attrs; @@ -1427,33 +1160,7 @@ static const char* set_links(cmd_parms* cmd, void* CFG, attr->val = att ; return NULL ; } -static const char* set_charset_alias(cmd_parms* cmd, void* CFG, - const char* charset, const char* alias) { - const char* errmsg = ap_check_cmd_context(cmd, GLOBAL_ONLY); - if (errmsg != NULL) - return errmsg ; - else if (xmlAddEncodingAlias(charset, alias) == 0) - return NULL; - else - return "Error setting charset alias"; -} -static const char* set_charset_default(cmd_parms* cmd, void* CFG, - const char* charset) { - proxy_html_conf* cfg = CFG; - cfg->default_encoding = xmlParseCharEncoding(charset); - switch(cfg->default_encoding) { - case XML_CHAR_ENCODING_NONE: - return "Default charset not found"; - case XML_CHAR_ENCODING_ERROR: - return "Invalid or unsupported default charset"; - default: - return NULL; - } -} static const command_rec proxy_html_cmds[] = { - AP_INIT_ITERATE("ProxyHTMLStartParse", set_skipto, NULL, - RSRC_CONF|ACCESS_CONF, - "Ignore anything in front of the first of these elements"), AP_INIT_ITERATE("ProxyHTMLEvents", set_events, NULL, RSRC_CONF|ACCESS_CONF, "Strings to be treated as scripting events"), AP_INIT_ITERATE2("ProxyHTMLLinks", set_links, NULL, @@ -1485,38 +1192,52 @@ static const command_rec proxy_html_cmds[] = { AP_INIT_TAKE1("ProxyHTMLBufSize", ap_set_int_slot, (void*)APR_OFFSETOF(proxy_html_conf, bufsz), RSRC_CONF|ACCESS_CONF, "Buffer size" ) , - AP_INIT_ITERATE2("ProxyHTMLCharsetAlias", set_charset_alias, NULL, - RSRC_CONF, "ProxyHTMLCharsetAlias charset alias [more aliases]" ) , - AP_INIT_TAKE1("ProxyHTMLCharsetDefault", set_charset_default, NULL, - RSRC_CONF|ACCESS_CONF, "Usage: ProxyHTMLCharsetDefault charset" ) , AP_INIT_TAKE1("ProxyHTMLCharsetOut", ap_set_string_slot, (void*)APR_OFFSETOF(proxy_html_conf, charset_out), RSRC_CONF|ACCESS_CONF, "Usage: ProxyHTMLCharsetOut charset" ) , + AP_INIT_FLAG("ProxyHTMLEnable", ap_set_flag_slot, + (void*)APR_OFFSETOF(proxy_html_conf, enabled), + RSRC_CONF|ACCESS_CONF, "Enable proxy-html and xml2enc filters" ) , { NULL } } ; static int mod_proxy_html(apr_pool_t* p, apr_pool_t* p1, apr_pool_t* p2, server_rec* s) { ap_add_version_component(p, VERSION_STRING) ; - seek_meta_ctype = ap_pregcomp(p, - "(]*http-equiv[ \t\r\n='\"]*content-type[^>]*>)", - AP_REG_EXTENDED|AP_REG_ICASE) ; - seek_charset = ap_pregcomp(p, "charset=([A-Za-z0-9_-]+)", - AP_REG_EXTENDED|AP_REG_ICASE) ; seek_meta = ap_pregcomp(p, "]*(http-equiv)[^>]*>", AP_REG_EXTENDED|AP_REG_ICASE) ; + seek_content = apr_strmatch_precompile(p, "content", 0); memset(&sax, 0, sizeof(htmlSAXHandler)); sax.startElement = pstartElement ; sax.endElement = pendElement ; sax.characters = pcharacters ; sax.comment = pcomment ; sax.cdataBlock = pcdata ; + xml2enc_charset = APR_RETRIEVE_OPTIONAL_FN(xml2enc_charset); + xml2enc_filter = APR_RETRIEVE_OPTIONAL_FN(xml2enc_filter); + if (!xml2enc_charset) { + ap_log_perror(APLOG_MARK, APLOG_NOTICE, 0, p2, + "I18n support in mod_proxy_html requires mod_xml2enc. " + "Without it, non-ASCII characters in proxied pages are " + "likely to display incorrectly."); + } return OK ; } +static void proxy_html_insert(request_rec* r) { + proxy_html_conf* cfg + = ap_get_module_config(r->per_dir_config, &proxy_html_module); + if (cfg->enabled) { + if (xml2enc_filter) + xml2enc_filter(r, NULL, ENCIO_INPUT_CHECKS); + ap_add_output_filter("proxy-html", NULL, r, r->connection); + } +} static void proxy_html_hooks(apr_pool_t* p) { + static const char* aszSucc[] = { "mod_filter.c", NULL }; ap_register_output_filter_protocol("proxy-html", proxy_html_filter, NULL, AP_FTYPE_RESOURCE, AP_FILTER_PROTO_CHANGE|AP_FILTER_PROTO_CHANGE_LENGTH) ; ap_hook_post_config(mod_proxy_html, NULL, NULL, APR_HOOK_MIDDLE) ; + ap_hook_insert_filter(proxy_html_insert, NULL, aszSucc, APR_HOOK_MIDDLE) ; } module AP_MODULE_DECLARE_DATA proxy_html_module = { STANDARD20_MODULE_STUFF, diff --git a/proxy_html.conf b/proxy_html.conf index 4e9367e..49afe98 100644 --- a/proxy_html.conf +++ b/proxy_html.conf @@ -1,16 +1,20 @@ # Configuration example. # -# First, to load the module with its prerequisites +# First, to load the module with its prerequisites. Note: mod_xml2enc +# is not always necessary, but without it mod_proxy_html is likely to +# mangle pages in encodings other than ASCII or Unicode (utf-8). # # For Unix-family systems: # LoadFile /usr/lib/libxml2.so # LoadModule proxy_html_module modules/mod_proxy_html.so +# LoadModule xml2enc_module modules/mod_xml2enc.so # # For Windows (I don't know if there's a standard path for the libraries) # LoadFile C:/path/zlib.dll # LoadFile C:/path/iconv.dll # LoadFile C:/path/libxml2.dll # LoadModule proxy_html_module modules/mod_proxy_html.so +# LoadModule xml2enc_module modules/mod_xml2enc.so # # All knowledge of HTML links has been removed from the mod_proxy_html # code itself, and is instead read from httpd.conf (or included file) @@ -56,7 +60,26 @@ ProxyHTMLEvents onclick ondblclick onmousedown onmouseup \ # # ProxyHTMLLinks myelement myattr otherattr # -# Also at top level in httpd.conf, you can declare charset aliases. -# This is the most efficient way to support encodings that libxml2 -# doesn't natively support. See the documentation at +########### +# EXAMPLE # +########### +# +# To define the URL /my-gateway/ as a gateway to an appserver with address +# http://some.app.intranet/ on a private network, after loading the +# modules and including this configuration file: +# +# ProxyRequests Off <-- this is an important security setting +# ProxyPass /my-gateway/ http://some.app.intranet/ +# +# ProxyPassReverse / +# ProxyHTMLEnable On +# ProxyHTMLURLMap http://some.app.intranet/ /my-gateway/ +# ProxyHTMLURLMap / /my-gateway/ +# +# +# Many (though not all) real-life setups are more complex. +# +# See the documentation at # http://apache.webthing.com/mod_proxy_html/ +# and the tutorial at +# http://www.apachetutor.org/admin/reverseproxies -- 2.11.0