From: Emmanuel Lacour Date: Tue, 16 Oct 2007 09:44:21 +0000 (+0000) Subject: New upstream release (3.0.0) X-Git-Tag: 3.0.0-1~3 X-Git-Url: http://git.home-dn.net/?p=manu%2Fmod-proxy-html.git;a=commitdiff_plain;h=c5bde94cb8c9fdf79fa6ae912c1a8536bf7e800c New upstream release (3.0.0) --- diff --git a/COPYING b/COPYING new file mode 100644 index 0000000..5b6e7c6 --- /dev/null +++ b/COPYING @@ -0,0 +1,340 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Library General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Library General +Public License instead of this License. diff --git a/README b/README new file mode 100644 index 0000000..296798b --- /dev/null +++ b/README @@ -0,0 +1,16 @@ +DOCUMENTATION for this module is at + http://apache.webthing.com/mod_proxy_html/ + +UPGRADING: IMPORTANT NOTE + +If you are upgrading from mod_proxy_html 2.x (or 1.x), you will need +some new configuration. You can Include proxy_html.conf from this +bundle in your httpd.conf (or apache.conf) to use Version 3 as a +drop-in replacement for Version 2. + +WINDOWS USERS: + +You may need to install some prerequisite libraries before you can +load mod_proxy_html into apache. If you don't already have them, +see the README at + http://apache.webthing.com/mod_proxy_html/windows/ diff --git a/config.html b/config.html index 7b062dc..f2fcacd 100644 --- a/config.html +++ b/config.html @@ -8,7 +8,8 @@

mod_proxy_html: Configuration

-

mod_proxy_html Version 2.4 (Sept 2004) and upwards

+

mod_proxy_html Version 2.4 (Sept 2004) and upwards. +Updates in Version 3 (Dec. 2006) are highlighted.

Configuration Directives

The following can be used anywhere in an httpd.conf or included configuration file.

@@ -16,7 +17,7 @@ or included configuration file.

ProxyHTMLURLMap

Syntax: -ProxyHTMLURLMap from-pattern to-pattern flags

+ProxyHTMLURLMap from-pattern to-pattern [flags] [cond]

This is the key directive for rewriting HTML links. When parsing a document, whenever a link target matches from-pattern, the matching portion will be rewritten to to-pattern.

@@ -24,6 +25,13 @@ portion will be rewritten to to-pattern.

and substitutions, including regular expression search and replace, controlled by the optional third flags argument.

+

Starting at version 3.0, this also supports environment variable +interpolation using the V and v flags, and rules may apply conditionally +based on an environment variable. Note that interpolation takes place +before the parse starts, so variables set during the parse (e.g. +using SSI directives) will not apply. This flexible configuration +is enabled by the ProxyHTMLInterp directive, or can +be disabled for speed.

Flags for ProxyHTMLURLMap

Flags are case-sensitive.

@@ -36,6 +44,9 @@ controlled by the optional third flags argument.
L

Last-match. If this rule matches, no more rules are applied (note that this happens automatically for HTML links).

+
l
+
Opposite to L. Overrides the one-change-only default +behaviour with HTML links.
R

Use Regular Expression matching-and-replace. from-pattern is a regexp, and to-pattern a replacement string that may be @@ -61,8 +72,35 @@ versions 1.x. Logic is starts-with in HTML links, but

$

Match at end only. This applies only to string matching (not regexps) and is irrelevant to HTML links.

+
V
+

Interpolate environment variables in to-pattern. +A string of the form ${varname|default} will be replaced by the +value of environment variable varname. If that is unset, it +is replaced by default. The |default is optional.

+

NOTE: interpolation will only be enabled if ProxyHTMLInterp is On.

+
+
v
+

Interpolate environment variables in from-pattern. +Patterns supported are as above.

+

NOTE: interpolation will only be enabled if ProxyHTMLInterp is On.

+
- +

Conditions for ProxyHTMLURLMap

+

The optional cond argument specifies a condition to +test before the parse. If a condition is unsatisfied, the URLMap +will be ignored in this parse.

+

The condition takes the form [!]var[=val], and is +satisfied if the value of environment variable var +is val. If the optional =val is omitted, +then any value of var satisfies the condition, provided +only it is set to something. If the first character is !, +the condition is reversed.

+

NOTE: conditions will only be applied if ProxyHTMLInterp is On.

+
+
ProxyHTMLInterp
+
+

Syntax: ProxyHTMLInterp On|Off

+

Enables new (per-request) features of ProxyHTMLURLMap.

ProxyHTMLDoctype
@@ -82,6 +120,10 @@ argument determines whether SGML/HTML or XML/XHTML syntax will be used.

Starting at version 2.0, the default is changed to omitting any FPI, on the grounds that no FPI is better than a bogus one. If your backend generates decent HTML or XHTML, set it accordingly.

+

From version 3, if the first form is used, mod_proxy_html +will also clean up the HTML to the specified standard. It cannot +fix every error, but it will strip out bogus elements and attributes. +It will also optionally log other errors at LogLevel Debug.

ProxyHTMLFixups
@@ -100,6 +142,9 @@ Only use them if you know you have a broken backend server.

Syntax ProxyHTMLMeta [On|Off]

Parses <meta http-equiv ...> elements to real HTTP headers.

+

In version 3, this is also tied in with the improved +internationalisation support, and is +required to support some character encodings.

ProxyHTMLExtended

Syntax ProxyHTMLExtended [On|Off]

@@ -137,10 +182,73 @@ be expanded as necessary to hold the largest script or stylesheet in a page, in increments of [nnnn] as set by this directive.

The default is 8192, and will work well for almost all pages. However, if you know you're proxying a lot of pages containing stylesheets and/or -scripts bigger than 8K, it will be more efficient to set a larger buffer +scripts bigger than 8K (that is, for a single script or stylesheet, +NOT in total), it will be more efficient to set a larger buffer size and avoid the need to resize the buffer dynamically during a request.

+
ProxyHTMLEvents
+
+

Syntax ProxyHTMLEvents attr [attr ...]

+

Specifies one or more attributes to treat as scripting events and +apply URLMaps to where appropriate. You can specify any number of +attributes in one or more ProxyHTMLEvents directives. +The sample configuration +defines the events in standard HTML 4 and XHTML 1.

+
+
ProxyHTMLLinks
+
+

Syntax ProxyHTMLLinks elt attr [attr ...]

+

Specifies elements that have URL attributes that should be rewritten +using standard URLMaps as in versions 1 and 2 of mod_proxy_html. +You will need one ProxyHTMLLinks directive per element, +but it can have any number of attributes. The sample configuration +defines the HTML links for standard HTML 4 and XHTML 1.

+
+
ProxyHTMLCharsetAlias
+
+

Syntax ProxyHTMLCharsetAlias charset alias [alias ...]

+

This server-wide directive aliases one or more charset to another +charset. This enables encodings not recognised by libxml2 to be handled +internally by libxml2's charset support using the translation table for +a recognised charset.

+

For example, Latin 1 (ISO-8859-1) is supported by libxml2. +Microsoft's Windows-1252 is almost identical and can be supported +by aliasing it:
+ProxyHTMLCharsetAlias ISO-8859-1 Windows-1252

+
+
ProxyHTMLCharsetDefault
+
+

Syntax ProxyHTMLCharsetDefault name

+

This defines the default encoding to assume when absolutely no charset +information is available from the backend server. The default value for +this is ISO-8859-1, as specified in HTTP/1.0 and assumed in +earlier mod_proxy_html versions.

+
+
ProxyHTMLCharsetOut
+
+

Syntax ProxyHTMLCharsetOut name

+

This selects an encoding for mod_proxy_html output. It should not +normally be used, as any change from the default UTF-8 +(Unicode - as used internally by libxml2) will impose an additional +processing overhead. The special token ProxyHTMLCharsetOut * +will generate output using the same encoding as the input.

+
+
ProxyHTMLStartParse
+
+

Syntax ProxyHTMLStartParse element [elt*]

+

Specify that the HTML parser should start at the first instance +of any of the elements specified. This can be used where a broken +backend inserts leading junk that messes up the parser (example here).

+
+

Other Configuration

+

Normally, mod_proxy_html will refuse to run when not +in a proxy or when the contents are not HTML. This can be overridden +(at your own risk) by setting the environment variable +PROXY_HTML_FORCE (e.g. with the SetEnv directive).

diff --git a/debian/changelog b/debian/changelog index ecb688b..f976198 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,3 +1,9 @@ +mod-proxy-html (3.0.0-1) unstable; urgency=low + + * New upstream release, closes: #446782 + + -- Emmanuel Lacour Tue, 16 Oct 2007 11:38:11 +0200 + mod-proxy-html (2.5.2-1.1) unstable; urgency=high * Non-maintainer upload. diff --git a/debian/copyright b/debian/copyright index 8edec20..c965cb8 100644 --- a/debian/copyright +++ b/debian/copyright @@ -5,7 +5,7 @@ It was downloaded from http://apache.webthing.com/mod_proxy_html/ Upstream Author: Nick Kew -Copyright (c) 2003-4, WebThing Ltd +Copyright (c) 2003-7, WebThing Ltd License: diff --git a/debian/install b/debian/install index 1ef61ee..e4b0c07 100644 --- a/debian/install +++ b/debian/install @@ -1,2 +1,3 @@ debian/conf/proxy_html.load /etc/apache2/mods-available/ +proxy_html.conf /etc/apache2/mods-available/ .libs/mod_proxy_html.so /usr/lib/apache2/modules/ diff --git a/guide.html b/guide.html index fac13e2..017c4eb 100644 --- a/guide.html +++ b/guide.html @@ -5,10 +5,14 @@ +

mod_proxy_html: Technical Guide

-

mod_proxy_html From Version 2.4 (Sept 2004).

+

mod_proxy_html From Version 2.4 (Sept 2004). +Updates in Version 3 (Dec. 2006) are highlighted.

Contents

  • URL Rewriting @@ -63,15 +67,23 @@ and scripting events, where it is clearly irrelevant.
  • DTDs as of type %URI. For example, the href attribute of the a element. For a full list, see the declaration of linked_elts in pstartElement. -Rules are applicable provided the h flag is not set.

    +Rules are applicable provided the h flag is not set. +From Version 3, the definition of links to use is +delegated to the system administrator via the ProxyHTMLLinks +directive.

    An HTML link always contains exactly one URL. So whenever mod_proxy_html finds a matching ProxyHTMLURLMap rule, it will apply the -transformation once and stop processing the attribute.

    +transformation once and stop processing the attribute. This +can be overridden by the l flag, which causes processing +a URL to continue after a rewrite.

    Scripting Events

    Scripting events are the contents of event attributes as defined in the HTML4 and XHTML1 DTDs; for example onclick. For a full list, see the declaration of events in pstartElement. -Rules are applicable provided the e flag is not set.

    +Rules are applicable provided the e flag is not set. +From Version 3, the definition of events to use is +delegated to the system administrator via the ProxyHTMLEvents +directive.

    A scripting event may contain more than one URL, and will contain other text. So when ProxyHTMLExtended is On, all applicable rules will be applied in order until and unless a rule with the L flag @@ -116,11 +128,18 @@ apply the appropriate rules in generating output. HTML saves a few bytes.

    If you declare a custom DTD, you should specify whether to generate HTML or XHTML syntax in the output. This affects empty elements: HTML <br> vs XHTML <br />.

    +

    If you select standard HTML or XHTML, mod_proxy_html 3 will +perform some additional fixups of bogus markup. If you don't want this, +you can enter a standard DTD using the nonstandard form of +ProxyHTMLDTD, which will then be treated as unknown +(no corrections).

    Character Encoding

    The parser uses UTF-8 (Unicode) internally, and -mod_proxy_html always generates output as UTF-8. This is -supported by all general-purpose web software, and supports more -character sets and languages than any other charset.

    +mod_proxy_html prior to version 3 always generates output as UTF-8. +This is supported by all general-purpose web software, and supports more +character sets and languages than any other charset. +Version 3 supports, but does not recommend different outputs, using +the ProxyHTMLCharsetOut directive.

    The character encoding should be declared in HTTP: for example
    Content-Type: text/html; charset=latin1
    mod_proxy_html has always supported this in its input, and ensured @@ -139,11 +158,28 @@ information. <meta http-equiv="Content-Type" ...>, any charset declared here is used.

  • In the absence of any of the above indications, the HTML-over-HTTP default -encoding ISO-8859-1 is assumed.
  • +encoding ISO-8859-1 or the +ProxyHTMLCharsetDefault value is assumed.
  • The parser is set to ignore invalid characters, so a malformed input stream will generate glitches (unexpected characters) rather than risk aborting a parse altogether.
  • +

    In version 3.0, this remains the default, but +internationalisation support is further improved, and is no longer +limited to the encodings supported by libxml2:

    +
      +
    • The ProxyHTMLCharsetAlias directive enables server +administrators to support additional encodings by aliasing them to +something supported by libxml2.
    • +
    • When a charset that is neither directly supported nor aliased is +encountered, mod_proxy_html 3 will attempt to support it using Apache/APR's +charset conversion support in apr_xlate, which on most platforms +is a wrapper for the leading conversion utility iconv. +Because of undocumented behaviour of libxml2, this may cause problems +when charset is specified in an HTML META element. This +feature is therefore only enabled when ProxyHTMLMeta is On.
    • +
    +

    meta http-equiv support

    The HTML meta element includes a form <meta http-equiv="Some-Header" contents="some-value"> diff --git a/mod_proxy_html.c b/mod_proxy_html.c index 26097c1..6a4e59b 100644 --- a/mod_proxy_html.c +++ b/mod_proxy_html.c @@ -1,51 +1,52 @@ /******************************************************************** - Copyright (c) 2003-5, WebThing Ltd - Author: Nick Kew + Copyright (c) 2003-7, WebThing Ltd + Author: Nick Kew This program is free software; you can redistribute it and/or modify -it under the terms of the GNU General Public License as published by -the Free Software Foundation; either version 2 of the License, or -(at your option) any later version. - +it under the terms of the GNU General Public License Version 2, +as published by the Free Software Foundation. + This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. -You should have received a copy of the GNU General Public License -along with this program; if not, write to the Free Software -Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. - +You can obtain a copy of the GNU General Poblic License Version 2 +from http://www.gnu.org/licenses/old-licenses/gpl-2.0.html or +http://apache.webthing.com/COPYING.txt + *********************************************************************/ /******************************************************************** - Note to Users + Note to Users - You are requested to register as a user, at - http://apache.webthing.com/registration.html + You are requested to register as a user, at + http://apache.webthing.com/registration.html - This entitles you to support from the developer. - I'm unlikely to reply to help/support requests from - non-registered users, unless you're paying and/or offering - constructive feedback such as bug reports or sensible - suggestions for further development. + This entitles you to support from the developer. + I'm unlikely to reply to help/support requests from + non-registered users, unless you're paying and/or offering + constructive feedback such as bug reports or sensible + suggestions for further development. - It also makes a small contribution to the effort - that's gone into developing this work. + It also makes a small contribution to the effort + that's gone into developing this work. *********************************************************************/ /* End of Notices */ -/* GO_FASTER - You can #define GO_FASTER to disable informational logging. - This disables the ProxyHTMLLogVerbose option altogether. - Default is to leave it undefined, and enable verbose logging - as a configuration option. Binaries are supplied with verbose - logging enabled. +/* GO_FASTER + + You can #define GO_FASTER to disable informational logging. + This disables the ProxyHTMLLogVerbose option altogether. + + Default is to leave it undefined, and enable verbose logging + as a configuration option. Binaries are supplied with verbose + logging enabled. */ #ifdef GO_FASTER @@ -54,11 +55,11 @@ Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. #define VERBOSE(x) if ( verbose ) x #endif -#define VERSION_STRING "proxy_html/2.5" +#define VERSION_STRING "proxy_html/3.0.0" #include -/* libxml */ +/* libxml2 */ #include /* apache */ @@ -66,6 +67,8 @@ Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. #include #include #include +#include +#include /* To support Apache 2.1/2.2, we need the ap_ forms of the * regexp stuff, and they're now used in the code. @@ -80,46 +83,73 @@ Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. #define AP_REG_ICASE REG_ICASE #define AP_REG_NOSUB REG_NOSUB #define AP_REG_NEWLINE REG_NEWLINE +#define APACHE20 +#define ap_register_output_filter_protocol(a,b,c,d,e) ap_register_output_filter(a,b,c,d) +#else +#define APACHE22 #endif module AP_MODULE_DECLARE_DATA proxy_html_module ; -#define M_HTML 0x01 -#define M_EVENTS 0x02 -#define M_CDATA 0x04 -#define M_REGEX 0x08 -#define M_ATSTART 0x10 -#define M_ATEND 0x20 -#define M_LAST 0x40 +#define M_HTML 0x01 +#define M_EVENTS 0x02 +#define M_CDATA 0x04 +#define M_REGEX 0x08 +#define M_ATSTART 0x10 +#define M_ATEND 0x20 +#define M_LAST 0x40 +#define M_NOTLAST 0x80 +#define M_INTERPOLATE_TO 0x100 +#define M_INTERPOLATE_FROM 0x200 typedef struct { + const char* val; +} tattr; +typedef struct { unsigned int start ; unsigned int end ; } meta ; +typedef struct { + const char* env; + const char* val; + int rel; +} rewritecond; typedef struct urlmap { struct urlmap* next ; unsigned int flags ; + unsigned int regflags ; union { const char* c ; ap_regex_t* r ; } from ; const char* to ; + rewritecond* cond; } urlmap ; typedef struct { urlmap* map ; const char* doctype ; const char* etag ; unsigned int flags ; + size_t bufsz ; + apr_hash_t* links; + apr_array_header_t* events; + apr_array_header_t* skipto; + xmlCharEncoding default_encoding; + const char* charset_out; int extfix ; int metafix ; int strip_comments ; + int interp; #ifndef GO_FASTER int verbose ; #endif - size_t bufsz ; } proxy_html_conf ; typedef struct { - htmlSAXHandlerPtr sax ; + apr_xlate_t* convset; + char* buf; + apr_size_t bytes; +} conv_t; +typedef struct { ap_filter_t* f ; proxy_html_conf* cfg ; htmlParserCtxtPtr parser ; @@ -127,58 +157,169 @@ typedef struct { char* buf ; size_t offset ; size_t avail ; + conv_t* conv_in; + conv_t* conv_out; + const char* encoding; + urlmap* map; } saxctxt ; -static int is_empty_elt(const char* name) { - const char** p ; - static const char* empty_elts[] = { - "br" , - "link" , - "img" , - "hr" , - "input" , - "meta" , - "base" , - "area" , - "param" , - "col" , - "frame" , - "isindex" , - "basefont" , - NULL - } ; - for ( p = empty_elts ; *p ; ++p ) - if ( !strcmp( *p, name) ) - return 1 ; - return 0 ; -} - -typedef struct { - const char* name ; - const char** attrs ; -} elt_t ; #define NORM_LC 0x1 #define NORM_MSSLASH 0x2 #define NORM_RESET 0x4 +static htmlSAXHandler sax ; typedef enum { ATTR_IGNORE, ATTR_URI, ATTR_EVENT } rewrite_t ; +static const char* const fpi_html = + "\n" ; +static const char* const fpi_html_legacy = + "\n" ; +static const char* const fpi_xhtml = + "\n" ; +static const char* const fpi_xhtml_legacy = + "\n" ; +static const char* const html_etag = ">" ; +static const char* const xhtml_etag = " />" ; +/*#define DEFAULT_DOCTYPE fpi_html */ +static const char* const DEFAULT_DOCTYPE = "" ; +#define DEFAULT_ETAG html_etag + static void normalise(unsigned int flags, char* str) { - xmlChar* p ; + char* p ; if ( flags & NORM_LC ) for ( p = str ; *p ; ++p ) if ( isupper(*p) ) - *p = tolower(*p) ; + *p = tolower(*p) ; if ( flags & NORM_MSSLASH ) - for ( p = strchr(str, '\\') ; p ; p = strchr(p+1, '\\') ) + for ( p = ap_strchr_c(str, '\\') ; p ; p = ap_strchr_c(p+1, '\\') ) *p = '/' ; } +static void consume_buffer(saxctxt* ctx, const char* inbuf, + int bytes, int flag) { + apr_status_t rv; + apr_size_t insz; + char* buf; +#ifndef GO_FASTER + int verbose = ctx->cfg->verbose; +#endif + if (ctx->conv_in == NULL) { + /* just feed it to libxml2 */ + htmlParseChunk(ctx->parser, inbuf, bytes, flag) ; + return; + } + if (ctx->conv_in->bytes > 0) { + /* FIXME: make this a reusable buf? */ + buf = apr_palloc(ctx->f->r->pool, ctx->conv_in->bytes + bytes); + memcpy(buf, ctx->conv_in->buf, ctx->conv_in->bytes); + memcpy(buf + ctx->conv_in->bytes, inbuf, bytes); + bytes += ctx->conv_in->bytes; + ctx->conv_in->bytes = 0; + } else { + buf = (char*) inbuf; + } + insz = bytes; + while (insz > 0) { + char outbuf[4096]; + apr_size_t outsz = 4096; + rv = apr_xlate_conv_buffer(ctx->conv_in->convset, + buf + (bytes - insz), &insz, + outbuf, &outsz); + htmlParseChunk(ctx->parser, outbuf, 4096-outsz, flag) ; + switch (rv) { + case APR_SUCCESS: + continue; + case APR_EINCOMPLETE: /* save dangling byte(s) and return */ + ctx->conv_in->bytes = insz; + ctx->conv_in->buf = (buf != inbuf) ? buf + (bytes-insz) + : apr_pmemdup(ctx->f->r->pool, buf + (bytes-insz), insz); + break; + case APR_EINVAL: /* try skipping one bad byte */ + VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, ctx->f->r, + "Skipping invalid byte in input stream!") ) ; + --insz; + continue; + default: + /* Erk! What's this? Bail out and eat the buf raw + * if libxml2 will accept it! + */ + ap_log_rerror(APLOG_MARK, APLOG_ERR, rv, ctx->f->r, + "Failed to convert input; trying it raw") ; + htmlParseChunk(ctx->parser, buf + (bytes - insz), insz, flag) ; + ctx->conv_in = NULL; /* don't try converting any more */ + break; + } + } +} +static void AP_fwrite(saxctxt* ctx, const char* inbuf, int bytes, int flush) { + /* convert charset if necessary, and output */ + char* buf; + apr_status_t rv; + apr_size_t insz ; +#ifndef GO_FASTER + int verbose = ctx->cfg->verbose; +#endif -#define FLUSH ap_fwrite(ctx->f->next, ctx->bb, (chars+begin), (i-begin)) ; begin = i+1 -static void pcharacters(void* ctxt, const xmlChar *chars, int length) { + if (ctx->conv_out == NULL) { + ap_fwrite(ctx->f->next, ctx->bb, inbuf, bytes); + return; + } + if (ctx->conv_out->bytes > 0) { + /* FIXME: make this a reusable buf? */ + buf = apr_palloc(ctx->f->r->pool, ctx->conv_out->bytes + bytes); + memcpy(buf, ctx->conv_out->buf, ctx->conv_out->bytes); + memcpy(buf + ctx->conv_out->bytes, inbuf, bytes); + bytes += ctx->conv_out->bytes; + ctx->conv_out->bytes = 0; + } else { + buf = (char*) inbuf; + } + insz = bytes; + while (insz > 0) { + char outbuf[2048]; + apr_size_t outsz = 2048; + rv = apr_xlate_conv_buffer(ctx->conv_out->convset, + buf + (bytes - insz), &insz, + outbuf, &outsz); + ap_fwrite(ctx->f->next, ctx->bb, outbuf, 2048-outsz) ; + switch (rv) { + case APR_SUCCESS: + continue; + case APR_EINCOMPLETE: /* save dangling byte(s) and return */ + /* but if we need to flush, just abandon them */ + if ( flush) { /* if we're flushing, this must be complete */ + /* so this is an error */ + VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, ctx->f->r, + "Skipping invalid byte in output stream!") ) ; + } else { + ctx->conv_out->bytes = insz; + ctx->conv_out->buf = (buf != inbuf) ? buf + (bytes-insz) + : apr_pmemdup(ctx->f->r->pool, buf + (bytes-insz), insz); + } + break; + case APR_EINVAL: /* try skipping one bad byte */ + VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, ctx->f->r, + "Skipping invalid byte in output stream!") ) ; + --insz; + continue; + default: + /* Erk! What's this? Bail out and pass the buf raw + * if libxml2 will accept it! + */ + VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_ERR, rv, ctx->f->r, + "Failed to convert output; sending UTF-8") ) ; + ap_fwrite(ctx->f->next, ctx->bb, buf + (bytes - insz), insz) ; + break; + } + } +} + +/* This is always utf-8 on entry. We can convert charset within FLUSH */ +#define FLUSH AP_fwrite(ctx, (chars+begin), (i-begin), 0) ; begin = i+1 +static void pcharacters(void* ctxt, const xmlChar *uchars, int length) { + const char* chars = (const char*) uchars; saxctxt* ctx = (saxctxt*) ctxt ; int i ; int begin ; @@ -203,9 +344,9 @@ static void preserve(saxctxt* ctx, const size_t len) { newbuf = realloc(ctx->buf, ctx->avail) ; if ( newbuf != ctx->buf ) { if ( ctx->buf ) - apr_pool_cleanup_kill(ctx->f->r->pool, ctx->buf, (void*)free) ; + apr_pool_cleanup_kill(ctx->f->r->pool, ctx->buf, (void*)free) ; apr_pool_cleanup_register(ctx->f->r->pool, newbuf, - (void*)free, apr_pool_cleanup_null); + (void*)free, apr_pool_cleanup_null); ctx->buf = newbuf ; } } @@ -224,81 +365,87 @@ static void dump_content(saxctxt* ctx) { ap_regmatch_t pmatch[10] ; char* subs ; size_t len, offs ; + urlmap* themap = ctx->map; #ifndef GO_FASTER int verbose = ctx->cfg->verbose ; #endif - pappend(ctx, &c, 1) ; /* append null byte */ - /* parse the text for URLs */ - for ( m = ctx->cfg->map ; m ; m = m->next ) { + pappend(ctx, &c, 1) ; /* append null byte */ + /* parse the text for URLs */ + for ( m = themap ; m ; m = m->next ) { if ( ! ( m->flags & M_CDATA ) ) - continue ; + continue ; if ( m->flags & M_REGEX ) { nmatch = 10 ; offs = 0 ; while ( ! ap_regexec(m->from.r, ctx->buf+offs, nmatch, pmatch, 0) ) { - match = pmatch[0].rm_so ; - s_from = pmatch[0].rm_eo - match ; - subs = ap_pregsub(ctx->f->r->pool, m->to, ctx->buf+offs, - nmatch, pmatch) ; - s_to = strlen(subs) ; - len = strlen(ctx->buf) ; - offs += match ; - VERBOSE( { - const char* f = apr_pstrndup(ctx->f->r->pool, - ctx->buf + offs , s_from ) ; - ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, - "C/RX: match at %s, substituting %s", f, subs) ; - } ) - if ( s_to > s_from) { - preserve(ctx, s_to - s_from) ; - memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, - len + 1 - s_from - offs) ; - memcpy(ctx->buf+offs, subs, s_to) ; - } else { - memcpy(ctx->buf + offs, subs, s_to) ; - memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, - len + 1 - s_from - offs) ; - } - offs += s_to ; + match = pmatch[0].rm_so ; + s_from = pmatch[0].rm_eo - match ; + subs = ap_pregsub(ctx->f->r->pool, m->to, ctx->buf+offs, + nmatch, pmatch) ; + s_to = strlen(subs) ; + len = strlen(ctx->buf) ; + offs += match ; + VERBOSE( { + const char* f = apr_pstrndup(ctx->f->r->pool, + ctx->buf + offs , s_from ) ; + ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, + "C/RX: match at %s, substituting %s", f, subs) ; + } ) + if ( s_to > s_from) { + preserve(ctx, s_to - s_from) ; + memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, + len + 1 - s_from - offs) ; + memcpy(ctx->buf+offs, subs, s_to) ; + } else { + memcpy(ctx->buf + offs, subs, s_to) ; + memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, + len + 1 - s_from - offs) ; + } + offs += s_to ; } } else { s_from = strlen(m->from.c) ; s_to = strlen(m->to) ; for ( found = strstr(ctx->buf, m->from.c) ; found ; - found = strstr(ctx->buf+match+s_to, m->from.c) ) { - match = found - ctx->buf ; - if ( ( m->flags & M_ATSTART ) && ( match != 0) ) - break ; - len = strlen(ctx->buf) ; - if ( ( m->flags & M_ATEND ) && ( match < (len - s_from) ) ) - continue ; - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, - "C: matched %s, substituting %s", m->from.c, m->to) ) ; - if ( s_to > s_from ) { - preserve(ctx, s_to - s_from) ; - memmove(ctx->buf+match+s_to, ctx->buf+match+s_from, - len + 1 - s_from - match) ; - memcpy(ctx->buf+match, m->to, s_to) ; - } else { - memcpy(ctx->buf+match, m->to, s_to) ; - memmove(ctx->buf+match+s_to, ctx->buf+match+s_from, - len + 1 - s_from - match) ; - } + found = strstr(ctx->buf+match+s_to, m->from.c) ) { + match = found - ctx->buf ; + if ( ( m->flags & M_ATSTART ) && ( match != 0) ) + break ; + len = strlen(ctx->buf) ; + if ( ( m->flags & M_ATEND ) && ( match < (len - s_from) ) ) + continue ; + VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, + "C: matched %s, substituting %s", m->from.c, m->to) ) ; + if ( s_to > s_from ) { + preserve(ctx, s_to - s_from) ; + memmove(ctx->buf+match+s_to, ctx->buf+match+s_from, + len + 1 - s_from - match) ; + memcpy(ctx->buf+match, m->to, s_to) ; + } else { + memcpy(ctx->buf+match, m->to, s_to) ; + memmove(ctx->buf+match+s_to, ctx->buf+match+s_from, + len + 1 - s_from - match) ; + } } } } - ap_fputs(ctx->f->next, ctx->bb, ctx->buf) ; + AP_fwrite(ctx, ctx->buf, strlen(ctx->buf), 1) ; } -static void pcdata(void* ctxt, const xmlChar *chars, int length) { +static void pcdata(void* ctxt, const xmlChar *uchars, int length) { + const char* chars = (const char*) uchars; saxctxt* ctx = (saxctxt*) ctxt ; if ( ctx->cfg->extfix ) { pappend(ctx, chars, length) ; } else { - ap_fwrite(ctx->f->next, ctx->bb, chars, length) ; + /* not sure if this should force-flush + * (i.e. can one cdata section come in multiple calls?) + */ + AP_fwrite(ctx, chars, length, 0) ; } } -static void pcomment(void* ctxt, const xmlChar *chars) { +static void pcomment(void* ctxt, const xmlChar *uchars) { + const char* chars = (const char*) uchars; saxctxt* ctx = (saxctxt*) ctxt ; if ( ctx->cfg->strip_comments ) return ; @@ -308,29 +455,47 @@ static void pcomment(void* ctxt, const xmlChar *chars) { pappend(ctx, chars, strlen(chars) ) ; pappend(ctx, "-->", 3) ; } else { - ap_fputstrs(ctx->f->next, ctx->bb, "", NULL) ; + ap_fputs(ctx->f->next, ctx->bb, "") ; } } -static void pendElement(void* ctxt, const xmlChar* name) { +static void pendElement(void* ctxt, const xmlChar* uname) { saxctxt* ctx = (saxctxt*) ctxt ; + const char* name = (const char*) uname; + const htmlElemDesc* desc = htmlTagLookup(uname); + + if ((ctx->cfg->doctype == fpi_html) || (ctx->cfg->doctype == fpi_xhtml)) { + /* enforce html */ + if (!desc || desc->depr) + return; + + } else if ((ctx->cfg->doctype == fpi_html) + || (ctx->cfg->doctype == fpi_xhtml)) { + /* enforce html legacy */ + if (!desc) + return; + } + /* TODO - implement HTML "allowed here" using the stack */ + /* nah. Keeping the stack is too much overhead */ + if ( ctx->offset > 0 ) { dump_content(ctx) ; - ctx->offset = 0 ; /* having dumped it, we can re-use the memory */ + ctx->offset = 0 ; /* having dumped it, we can re-use the memory */ } - if ( ! is_empty_elt(name) ) + if ( !desc || ! desc->empty ) { ap_fprintf(ctx->f->next, ctx->bb, "", name) ; + } } -static void pstartElement(void* ctxt, const xmlChar* name, - const xmlChar** attrs ) { +static void pstartElement(void* ctxt, const xmlChar* uname, + const xmlChar** uattrs ) { + int required_attrs ; int num_match ; size_t offs, len ; char* subs ; rewrite_t is_uri ; - const char** linkattrs ; - const xmlChar** a ; - const elt_t* elt ; - const char** linkattr ; + const char** a ; urlmap* m ; size_t s_to, s_from, match ; char* found ; @@ -340,347 +505,359 @@ static void pstartElement(void* ctxt, const xmlChar* name, #ifndef GO_FASTER int verbose = ctx->cfg->verbose ; #endif - - static const char* href[] = { "href", NULL } ; - static const char* cite[] = { "cite", NULL } ; - static const char* action[] = { "action", NULL } ; - static const char* imgattr[] = { "src", "longdesc", "usemap", NULL } ; - static const char* inputattr[] = { "src", "usemap", NULL } ; - static const char* scriptattr[] = { "src", "for", NULL } ; - static const char* frameattr[] = { "src", "longdesc", NULL } ; - static const char* objattr[] = - { "classid", "codebase", "data", "usemap", NULL } ; - static const char* profile[] = { "profile", NULL } ; - static const char* background[] = { "background", NULL } ; - static const char* codebase[] = { "codebase", NULL } ; - - static const elt_t linked_elts[] = { - { "a" , href } , - { "img" , imgattr } , - { "form", action } , - { "link" , href } , - { "script" , scriptattr } , - { "base" , href } , - { "area" , href } , - { "input" , inputattr } , - { "frame", frameattr } , - { "iframe", frameattr } , - { "object", objattr } , - { "q" , cite } , - { "blockquote" , cite } , - { "ins" , cite } , - { "del" , cite } , - { "head" , profile } , - { "body" , background } , - { "applet", codebase } , - { NULL, NULL } - } ; - static const char* events[] = { - "onclick" , - "ondblclick" , - "onmousedown" , - "onmouseup" , - "onmouseover" , - "onmousemove" , - "onmouseout" , - "onkeypress" , - "onkeydown" , - "onkeyup" , - "onfocus" , - "onblur" , - "onload" , - "onunload" , - "onsubmit" , - "onreset" , - "onselect" , - "onchange" , - NULL - } ; + apr_array_header_t *linkattrs; + int i; + const char* name = (const char*) uname; + const char** attrs = (const char**) uattrs; + const htmlElemDesc* desc = htmlTagLookup(uname); + urlmap* themap = ctx->map; +#ifdef HAVE_STACK + const void** descp; +#endif + int enforce = 0; + if ((ctx->cfg->doctype == fpi_html) || (ctx->cfg->doctype == fpi_xhtml)) { + /* enforce html */ + enforce = 2; + if (!desc || desc->depr) + return; + + } else if ((ctx->cfg->doctype == fpi_html) + || (ctx->cfg->doctype == fpi_xhtml)) { + enforce = 1; + /* enforce html legacy */ + if (!desc) { + return; + } + } + if (!desc && enforce) { + ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, ctx->f->r, + "Bogus HTML element %s dropped", name) ; + return; + } + if (desc && desc->depr && (enforce == 2) ) { + ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, ctx->f->r, + "Deprecated HTML element %s dropped", name) ; + return; + } +#ifdef HAVE_STACK + descp = apr_array_push(ctx->stack); + *descp = desc; + /* TODO - implement HTML "allowed here" */ +#endif ap_fputc(ctx->f->next, ctx->bb, '<') ; ap_fputs(ctx->f->next, ctx->bb, name) ; + required_attrs = 0; + if ((enforce > 0) && (desc != NULL) && (desc->attrs_req != NULL)) + for (a = desc->attrs_req; *a; a++) + ++required_attrs; + if ( attrs ) { - linkattrs = 0 ; - for ( elt = linked_elts; elt->name != NULL ; ++elt ) - if ( !strcmp(elt->name, name) ) { - linkattrs = elt->attrs ; - break ; - } + linkattrs = apr_hash_get(ctx->cfg->links, name, APR_HASH_KEY_STRING) ; for ( a = attrs ; *a ; a += 2 ) { + if (desc && enforce > 0) { + switch (htmlAttrAllowed(desc, (xmlChar*)*a, 2-enforce)) { + case HTML_INVALID: + ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, ctx->f->r, + "Bogus HTML attribute %s of %s dropped", *a, name); + continue; + case HTML_DEPRECATED: + ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, ctx->f->r, + "Deprecated HTML attribute %s of %s dropped", *a, name); + continue; + case HTML_REQUIRED: + required_attrs--; /* cross off the number still needed */ + /* fallthrough - required implies valid */ + default: + break; + } + } ctx->offset = 0 ; if ( a[1] ) { - pappend(ctx, a[1], strlen(a[1])+1) ; - is_uri = ATTR_IGNORE ; - if ( linkattrs ) { - for ( linkattr = linkattrs ; *linkattr ; ++linkattr) { - if ( !strcmp(*linkattr, *a) ) { - is_uri = ATTR_URI ; - break ; - } - } - } - if ( (is_uri == ATTR_IGNORE) && ctx->cfg->extfix ) { - for ( linkattr = events; *linkattr; ++linkattr ) { - if ( !strcmp(*linkattr, *a) ) { - is_uri = ATTR_EVENT ; - break ; - } - } - } - switch ( is_uri ) { - case ATTR_URI: - num_match = 0 ; - for ( m = ctx->cfg->map ; m ; m = m->next ) { - if ( ! ( m->flags & M_HTML ) ) - continue ; - if ( m->flags & M_REGEX ) { - nmatch = 10 ; - if ( ! ap_regexec(m->from.r, ctx->buf, nmatch, pmatch, 0) ) { - ++num_match ; - offs = match = pmatch[0].rm_so ; - s_from = pmatch[0].rm_eo - match ; - subs = ap_pregsub(ctx->f->r->pool, m->to, ctx->buf+offs, - nmatch, pmatch) ; - VERBOSE( { - const char* f = apr_pstrndup(ctx->f->r->pool, - ctx->buf + offs , s_from ) ; - ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, - "H/RX: match at %s, substituting %s", f, subs) ; - } ) - s_to = strlen(subs) ; - len = strlen(ctx->buf) ; - if ( s_to > s_from) { - preserve(ctx, s_to - s_from) ; - memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, - len + 1 - s_from - offs) ; - memcpy(ctx->buf+offs, subs, s_to) ; - } else { - memcpy(ctx->buf + offs, subs, s_to) ; - memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, - len + 1 - s_from - offs) ; - } - } - } else { - s_from = strlen(m->from.c) ; - if ( ! strncasecmp(ctx->buf, m->from.c, s_from ) ) { - ++num_match ; - s_to = strlen(m->to) ; - len = strlen(ctx->buf) ; - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, - "H: matched %s, substituting %s", m->from.c, m->to) ) ; - if ( s_to > s_from ) { - preserve(ctx, s_to - s_from) ; - memmove(ctx->buf+s_to, ctx->buf+s_from, - len + 1 - s_from ) ; - memcpy(ctx->buf, m->to, s_to) ; - } else { /* it fits in the existing space */ - memcpy(ctx->buf, m->to, s_to) ; - memmove(ctx->buf+s_to, ctx->buf+s_from, - len + 1 - s_from) ; - } - break ; - } - } - if ( num_match > 0 ) /* URIs only want one match */ - break ; - } - break ; - case ATTR_EVENT: - for ( m = ctx->cfg->map ; m ; m = m->next ) { - num_match = 0 ; /* reset here since we're working per-rule */ - if ( ! ( m->flags & M_EVENTS ) ) - continue ; - if ( m->flags & M_REGEX ) { - nmatch = 10 ; - offs = 0 ; - while ( ! ap_regexec(m->from.r, ctx->buf+offs, - nmatch, pmatch, 0) ) { - match = pmatch[0].rm_so ; - s_from = pmatch[0].rm_eo - match ; - subs = ap_pregsub(ctx->f->r->pool, m->to, ctx->buf+offs, - nmatch, pmatch) ; - VERBOSE( { - const char* f = apr_pstrndup(ctx->f->r->pool, - ctx->buf + offs , s_from ) ; - ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, - "E/RX: match at %s, substituting %s", f, subs) ; - } ) - s_to = strlen(subs) ; - offs += match ; - len = strlen(ctx->buf) ; - if ( s_to > s_from) { - preserve(ctx, s_to - s_from) ; - memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, - len + 1 - s_from - offs) ; - memcpy(ctx->buf+offs, subs, s_to) ; - } else { - memcpy(ctx->buf + offs, subs, s_to) ; - memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, - len + 1 - s_from - offs) ; - } - offs += s_to ; - ++num_match ; - } - } else { - found = strstr(ctx->buf, m->from.c) ; - if ( (m->flags & M_ATSTART) && ( found != ctx->buf) ) - continue ; - while ( found ) { - s_from = strlen(m->from.c) ; - s_to = strlen(m->to) ; - match = found - ctx->buf ; - if ( ( s_from < strlen(found) ) && (m->flags & M_ATEND ) ) { - found = strstr(ctx->buf+match+s_from, m->from.c) ; - continue ; - } else { - found = strstr(ctx->buf+match+s_to, m->from.c) ; - } - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, - "E: matched %s, substituting %s", m->from.c, m->to) ) ; - len = strlen(ctx->buf) ; - if ( s_to > s_from ) { - preserve(ctx, s_to - s_from) ; - memmove(ctx->buf+match+s_to, ctx->buf+match+s_from, - len + 1 - s_from - match) ; - memcpy(ctx->buf+match, m->to, s_to) ; - } else { - memcpy(ctx->buf+match, m->to, s_to) ; - memmove(ctx->buf+match+s_to, ctx->buf+match+s_from, - len + 1 - s_from - match) ; - } - ++num_match ; - } - } - if ( num_match && ( m->flags & M_LAST ) ) - break ; - } - break ; - case ATTR_IGNORE: - break ; - } + pappend(ctx, a[1], strlen(a[1])+1) ; + is_uri = ATTR_IGNORE ; + if ( linkattrs ) { + tattr* attrs = (tattr*) linkattrs->elts; + for (i=0; i < linkattrs->nelts; ++i) { + if ( !strcmp(*a, attrs[i].val)) { + is_uri = ATTR_URI ; + break ; + } + } + } + if ( (is_uri == ATTR_IGNORE) && ctx->cfg->extfix + && (ctx->cfg->events != NULL) ) { + for (i=0; i < ctx->cfg->events->nelts; ++i) { + tattr* attrs = (tattr*) ctx->cfg->events->elts; + if ( !strcmp(*a, attrs[i].val)) { + is_uri = ATTR_EVENT ; + break ; + } + } + } + switch ( is_uri ) { + case ATTR_URI: + num_match = 0 ; + for ( m = themap ; m ; m = m->next ) { + if ( ! ( m->flags & M_HTML ) ) + continue ; + if ( m->flags & M_REGEX ) { + nmatch = 10 ; + if ( ! ap_regexec(m->from.r, ctx->buf, nmatch, pmatch, 0) ) { + ++num_match ; + offs = match = pmatch[0].rm_so ; + s_from = pmatch[0].rm_eo - match ; + subs = ap_pregsub(ctx->f->r->pool, m->to, ctx->buf+offs, + nmatch, pmatch) ; + VERBOSE( { + const char* f = apr_pstrndup(ctx->f->r->pool, + ctx->buf + offs , s_from ) ; + ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, + "H/RX: match at %s, substituting %s", f, subs) ; + } ) + s_to = strlen(subs) ; + len = strlen(ctx->buf) ; + if ( s_to > s_from) { + preserve(ctx, s_to - s_from) ; + memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, + len + 1 - s_from - offs) ; + memcpy(ctx->buf+offs, subs, s_to) ; + } else { + memcpy(ctx->buf + offs, subs, s_to) ; + memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, + len + 1 - s_from - offs) ; + } + } + } else { + s_from = strlen(m->from.c) ; + if ( ! strncasecmp(ctx->buf, m->from.c, s_from ) ) { + ++num_match ; + s_to = strlen(m->to) ; + len = strlen(ctx->buf) ; + VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, + "H: matched %s, substituting %s", m->from.c, m->to) ) ; + if ( s_to > s_from ) { + preserve(ctx, s_to - s_from) ; + memmove(ctx->buf+s_to, ctx->buf+s_from, + len + 1 - s_from ) ; + memcpy(ctx->buf, m->to, s_to) ; + } else { /* it fits in the existing space */ + memcpy(ctx->buf, m->to, s_to) ; + memmove(ctx->buf+s_to, ctx->buf+s_from, + len + 1 - s_from) ; + } + break ; + } + } + /* URIs only want one match unless overridden in the config */ + if ( (num_match > 0) && !( m->flags & M_NOTLAST ) ) + break ; + } + break ; + case ATTR_EVENT: + for ( m = themap ; m ; m = m->next ) { + num_match = 0 ; /* reset here since we're working per-rule */ + if ( ! ( m->flags & M_EVENTS ) ) + continue ; + if ( m->flags & M_REGEX ) { + nmatch = 10 ; + offs = 0 ; + while ( ! ap_regexec(m->from.r, ctx->buf+offs, + nmatch, pmatch, 0) ) { + match = pmatch[0].rm_so ; + s_from = pmatch[0].rm_eo - match ; + subs = ap_pregsub(ctx->f->r->pool, m->to, ctx->buf+offs, + nmatch, pmatch) ; + VERBOSE( { + const char* f = apr_pstrndup(ctx->f->r->pool, + ctx->buf + offs , s_from ) ; + ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, + "E/RX: match at %s, substituting %s", f, subs) ; + } ) + s_to = strlen(subs) ; + offs += match ; + len = strlen(ctx->buf) ; + if ( s_to > s_from) { + preserve(ctx, s_to - s_from) ; + memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, + len + 1 - s_from - offs) ; + memcpy(ctx->buf+offs, subs, s_to) ; + } else { + memcpy(ctx->buf + offs, subs, s_to) ; + memmove(ctx->buf+offs+s_to, ctx->buf+offs+s_from, + len + 1 - s_from - offs) ; + } + offs += s_to ; + ++num_match ; + } + } else { + found = strstr(ctx->buf, m->from.c) ; + if ( (m->flags & M_ATSTART) && ( found != ctx->buf) ) + continue ; + while ( found ) { + s_from = strlen(m->from.c) ; + s_to = strlen(m->to) ; + match = found - ctx->buf ; + if ( ( s_from < strlen(found) ) && (m->flags & M_ATEND ) ) { + found = strstr(ctx->buf+match+s_from, m->from.c) ; + continue ; + } else { + found = strstr(ctx->buf+match+s_to, m->from.c) ; + } + VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, ctx->f->r, + "E: matched %s, substituting %s", m->from.c, m->to) ) ; + len = strlen(ctx->buf) ; + if ( s_to > s_from ) { + preserve(ctx, s_to - s_from) ; + memmove(ctx->buf+match+s_to, ctx->buf+match+s_from, + len + 1 - s_from - match) ; + memcpy(ctx->buf+match, m->to, s_to) ; + } else { + memcpy(ctx->buf+match, m->to, s_to) ; + memmove(ctx->buf+match+s_to, ctx->buf+match+s_from, + len + 1 - s_from - match) ; + } + ++num_match ; + } + } + if ( num_match && ( m->flags & M_LAST ) ) + break ; + } + break ; + case ATTR_IGNORE: + break ; + } } if ( ! a[1] ) - ap_fputstrs(ctx->f->next, ctx->bb, " ", a[0], NULL) ; + ap_fputstrs(ctx->f->next, ctx->bb, " ", a[0], NULL) ; else { - if ( ctx->cfg->flags != 0 ) - normalise(ctx->cfg->flags, ctx->buf) ; + if ( ctx->cfg->flags != 0 ) + normalise(ctx->cfg->flags, ctx->buf) ; - /* write the attribute, using pcharacters to html-escape - anything that needs it in the value. - */ - ap_fputstrs(ctx->f->next, ctx->bb, " ", a[0], "=\"", NULL) ; - pcharacters(ctx, ctx->buf, strlen(ctx->buf)) ; - ap_fputc(ctx->f->next, ctx->bb, '"') ; + /* write the attribute, using pcharacters to html-escape + anything that needs it in the value. + */ + ap_fputstrs(ctx->f->next, ctx->bb, " ", a[0], "=\"", NULL) ; + pcharacters(ctx, (const xmlChar*)ctx->buf, strlen(ctx->buf)) ; + ap_fputc(ctx->f->next, ctx->bb, '"') ; } } } ctx->offset = 0 ; - if ( is_empty_elt(name) ) + if ( desc && desc->empty ) ap_fputs(ctx->f->next, ctx->bb, ctx->cfg->etag) ; else ap_fputc(ctx->f->next, ctx->bb, '>') ; -} -static htmlSAXHandlerPtr setupSAX(apr_pool_t* pool) { - htmlSAXHandlerPtr sax = apr_pcalloc(pool, sizeof(htmlSAXHandler) ) ; - sax->startDocument = NULL ; - sax->endDocument = NULL ; - sax->startElement = pstartElement ; - sax->endElement = pendElement ; - sax->characters = pcharacters ; - sax->comment = pcomment ; - sax->cdataBlock = pcdata ; - return sax ; + + if ((enforce > 0) && (required_attrs > 0)) { + /* if there are more required attributes than we found then complain */ + ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, ctx->f->r, + "HTML element %s is missing %d required attributes", + name, required_attrs); + } } +/* globals set once at startup */ static ap_regex_t* seek_meta_ctype ; static ap_regex_t* seek_charset ; static ap_regex_t* seek_meta ; -static void proxy_html_child_init(apr_pool_t* pool, server_rec* s) { - seek_meta_ctype = ap_pregcomp(pool, - "(]*http-equiv[ \t\r\n='\"]*content-type[^>]*>)", - AP_REG_EXTENDED|AP_REG_ICASE) ; - seek_charset = ap_pregcomp(pool, "charset=([A-Za-z0-9_-]+)", - AP_REG_EXTENDED|AP_REG_ICASE) ; - seek_meta = ap_pregcomp(pool, "]*(http-equiv)[^>]*>", - AP_REG_EXTENDED|AP_REG_ICASE) ; -} - -static xmlCharEncoding sniff_encoding( - request_rec* r, const char* cbuf, size_t bytes +static xmlCharEncoding sniff_encoding(saxctxt* ctx, const char* cbuf, + size_t bytes) { #ifndef GO_FASTER - , int verbose + int verbose = ctx->cfg->verbose; #endif - ) { + request_rec* r = ctx->f->r ; + proxy_html_conf* cfg = ctx->cfg ; xmlCharEncoding ret ; - char* encoding = NULL ; char* p ; ap_regmatch_t match[2] ; - unsigned char* buf = (unsigned char*)cbuf ; + char* buf = (char*)cbuf ; + apr_xlate_t* convset; VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, - "Content-Type is %s", r->content_type) ) ; + "Content-Type is %s", r->content_type) ) ; /* If we've got it in the HTTP headers, there's nothing to do */ if ( r->content_type && - ( p = ap_strcasestr(r->content_type, "charset=") , p > 0 ) ) { + ( p = ap_strcasestr(r->content_type, "charset=") , p > 0 ) ) { p += 8 ; - if ( encoding = apr_pstrndup(r->pool, p, strcspn(p, " ;") ) , encoding ) { - if ( ret = xmlParseCharEncoding(encoding), - ret != XML_CHAR_ENCODING_ERROR ) { - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, - "Got charset %s from HTTP headers", encoding) ) ; - return ret ; - } else { - ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, - "Unsupported charset %s in HTTP headers", encoding) ; - encoding = NULL ; + if ( ctx->encoding = apr_pstrndup(r->pool, p, strcspn(p, " ;") ) , + ctx->encoding ) { + VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, + "Got charset %s from HTTP headers", ctx->encoding) ) ; + if ( ret = xmlParseCharEncoding(ctx->encoding), + ((ret != XML_CHAR_ENCODING_ERROR ) + && (ret != XML_CHAR_ENCODING_NONE))) { + return ret ; } } } /* to sniff, first we look for BOM */ - if ( ret = xmlDetectCharEncoding(buf, bytes), - ret != XML_CHAR_ENCODING_NONE ) { - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, - "Got charset from XML rules.") ) ; - return ret ; - } + if (ctx->encoding == NULL) { + if ( ret = xmlDetectCharEncoding((const xmlChar*)buf, bytes), + ret != XML_CHAR_ENCODING_NONE ) { + VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, + "Got charset from XML rules.") ) ; + return ret ; + } /* If none of the above, look for a META-thingey */ - encoding = NULL ; - if ( ap_regexec(seek_meta_ctype, buf, 1, match, 0) == 0 ) { - p = apr_pstrndup(r->pool, buf + match[0].rm_so, - match[0].rm_eo - match[0].rm_so) ; - if ( ap_regexec(seek_charset, p, 2, match, 0) == 0 ) - encoding = apr_pstrndup(r->pool, p+match[1].rm_so, - match[1].rm_eo - match[1].rm_so) ; + if ( ap_regexec(seek_meta_ctype, buf, 1, match, 0) == 0 ) { + p = apr_pstrndup(r->pool, buf + match[0].rm_so, + match[0].rm_eo - match[0].rm_so) ; + if ( ap_regexec(seek_charset, p, 2, match, 0) == 0 ) + ctx->encoding = apr_pstrndup(r->pool, p+match[1].rm_so, + match[1].rm_eo - match[1].rm_so) ; + } } /* either it's set to something we found or it's still the default */ - if ( encoding ) { - if ( ret = xmlParseCharEncoding(encoding), - ret != XML_CHAR_ENCODING_ERROR ) { - VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, - "Got charset %s from HTML META", encoding) ) ; + if ( ctx->encoding ) { + VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, + "Got charset %s from HTML META", ctx->encoding) ) ; + if ( ret = xmlParseCharEncoding(ctx->encoding), + ((ret != XML_CHAR_ENCODING_ERROR ) + && (ret != XML_CHAR_ENCODING_NONE))) { return ret ; + } +/* Unsupported charset. Can we get (iconv) support through apr_xlate? */ +/* Aaargh! libxml2 has undocumented support. So this fails + * if metafix is not active. Have to make it conditional. + */ + if (cfg->metafix) { + VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, r, + "Charset %s not supported by libxml2; trying apr_xlate", ctx->encoding) ) ; + if (apr_xlate_open(&convset, "UTF-8", ctx->encoding, r->pool) == APR_SUCCESS) { + ctx->conv_in = apr_pcalloc(r->pool, sizeof(conv_t)); + ctx->conv_in->convset = convset ; + return XML_CHAR_ENCODING_UTF8 ; + } else { + ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, + "Charset %s not supported. Consider aliasing it?", ctx->encoding) ; + } } else { ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, r, - "Unsupported charset %s in HTML META", encoding) ; + "Charset %s not supported. Consider aliasing it or use metafix?", + ctx->encoding) ; } } -/* the old HTTP default is a last resort */ + + +/* Use configuration default as a last resort */ ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, r, - "No usable charset information: using old HTTP default LATIN1") ; - return XML_CHAR_ENCODING_8859_1 ; + "No usable charset information; using configuration default") ; + return (cfg->default_encoding == XML_CHAR_ENCODING_NONE) + ? XML_CHAR_ENCODING_8859_1 : cfg->default_encoding ; } static meta* metafix(request_rec* r, const char* buf /*, size_t bytes*/ #ifndef GO_FASTER - , int verbose + , int verbose #endif - ) { + ) { meta* ret = NULL ; size_t offs = 0 ; const char* p ; @@ -700,20 +877,20 @@ static meta* metafix(request_rec* r, const char* buf /*, size_t bytes*/ if ( strncasecmp(header, "Content-", 8) ) { /* find content=... string */ for ( p = ap_strstr((char*)buf+offs+pmatch[0].rm_so, "content") ; *p ; ) { - p += 7 ; - while ( *p && isspace(*p) ) - ++p ; - if ( *p != '=' ) - continue ; - while ( *p && isspace(*++p) ) ; - if ( ( *p == '\'' ) || ( *p == '"' ) ) { - delim = *p++ ; - for ( q = p ; *q != delim ; ++q ) ; - } else { - for ( q = p ; *q && !isspace(*q) && (*q != '>') ; ++q ) ; - } - content = apr_pstrndup(r->pool, p, q-p) ; - break ; + p += 7 ; + while ( *p && isspace(*p) ) + ++p ; + if ( *p != '=' ) + continue ; + while ( *p && isspace(*++p) ) ; + if ( ( *p == '\'' ) || ( *p == '"' ) ) { + delim = *p++ ; + for ( q = p ; *q != delim ; ++q ) ; + } else { + for ( q = p ; *q && !isspace(*q) && (*q != '>') ; ++q ) ; + } + content = apr_pstrndup(r->pool, p, q-p) ; + break ; } } else if ( !strncasecmp(header, "Content-Type", 12) ) { ret = apr_palloc(r->pool, sizeof(meta) ) ; @@ -722,7 +899,7 @@ static meta* metafix(request_rec* r, const char* buf /*, size_t bytes*/ } if ( header && content ) { VERBOSE( ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r, - "Adding header [%s: %s] from HTML META", header, content) ) ; + "Adding header [%s: %s] from HTML META", header, content) ) ; apr_table_setn(r->headers_out, header, content) ; } offs += pmatch[0].rm_eo ; @@ -730,57 +907,121 @@ static meta* metafix(request_rec* r, const char* buf /*, size_t bytes*/ return ret ; } -static int proxy_html_filter_init(ap_filter_t* f) { - const char* env ; - saxctxt* fctx ; +static const char* interpolate_vars(request_rec* r, const char* str) { + const char* start; + const char* end; + const char* delim; + const char* before; + const char* after; + const char* replacement; + const char* var; + for (;;) { + start = str ; + if (start = ap_strstr_c(start, "${"), start == NULL) + break; -#if 0 -/* remove content-length filter */ - ap_filter_rec_t* clf = ap_get_output_filter_handle("CONTENT_LENGTH") ; - ap_filter_t* ff = f->next ; - - do { - ap_filter_t* fnext = ff->next ; - if ( ff->frec == clf ) - ap_remove_output_filter(ff) ; - ff = fnext ; - } while ( ff ) ; -#endif + if (end = ap_strchr_c(start+2, '}'), end == NULL) + break; - fctx = f->ctx = apr_pcalloc(f->r->pool, sizeof(saxctxt)) ; - fctx->sax = setupSAX(f->r->pool) ; - fctx->f = f ; - fctx->bb = apr_brigade_create(f->r->pool, f->r->connection->bucket_alloc) ; - fctx->cfg = ap_get_module_config(f->r->per_dir_config,&proxy_html_module); - - if ( f->r->proto_num >= 1001 ) { - if ( ! f->r->main && ! f->r->prev ) { - env = apr_table_get(f->r->subprocess_env, "force-response-1.0") ; - if ( !env ) - f->r->chunked = 1 ; + delim = ap_strchr_c(start, '|'); + before = apr_pstrndup(r->pool, str, start-str); + after = end+1; + if (delim) { + var = apr_pstrndup(r->pool, start+2, delim-start-2) ; + } else { + var = apr_pstrndup(r->pool, start+2, end-start-2) ; } + replacement = apr_table_get(r->subprocess_env, var) ; + if (!replacement) + if (delim) + replacement = apr_pstrndup(r->pool, delim+1, end-delim-1); + else + replacement = ""; + str = apr_pstrcat(r->pool, before, replacement, after, NULL); + ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, r, + "Interpolating %s => %s", var, replacement) ; } + return str; +} +static void fixup_rules(saxctxt* ctx) { + const char* thisval; + urlmap* newp; + urlmap* p; + urlmap* prev = NULL; + request_rec* r = ctx->f->r; + int has_cond; - apr_table_unset(f->r->headers_out, "Content-Length") ; - apr_table_unset(f->r->headers_out, "ETag") ; - return OK ; + for (p = ctx->cfg->map; p; p = p->next) { + has_cond = -1; + if (p->cond != NULL) { + thisval = apr_table_get(r->subprocess_env, p->cond->env); + if (!p->cond->val) { + /* required to be "anything" */ + if (thisval) + has_cond = 1; /* satisfied */ + else + has_cond = 0; /* unsatisfied */ + } else { + if (thisval && !strcasecmp(p->cond->val, thisval)) { + has_cond = 1; /* satisfied */ + } else { + has_cond = 0; /* unsatisfied */ + } + } + if (((has_cond == 0) && (p->cond->rel ==1 )) + || ((has_cond == 1) && (p->cond->rel == -1))) { + continue; /* condition is unsatisfied */ + } + } + + newp = apr_pmemdup(r->pool, p, sizeof(urlmap)); + + if (newp->flags & M_INTERPOLATE_FROM) { + newp->from.c = interpolate_vars(r, newp->from.c); + if (!newp->from.c || !*newp->from.c) + continue; /* don't use empty from-pattern */ + if (newp->flags & M_REGEX) { + newp->from.r = ap_pregcomp(r->pool, newp->from.c, newp->regflags) ; + } + } + if (newp->flags & M_INTERPOLATE_TO) { + newp->to = interpolate_vars(r, newp->to); + } + /* evaluate p->cond; continue if unsatisfied */ + /* create new urlmap with memcpy and append to map */ + /* interpolate from if flagged to do so */ + /* interpolate to if flagged to do so */ + + if (prev != NULL) + prev->next = newp ; + else + ctx->map = newp ; + prev = newp ; + } + + if (prev) + prev->next = NULL; } static saxctxt* check_filter_init (ap_filter_t* f) { + saxctxt* fctx ; + proxy_html_conf* cfg + = ap_get_module_config(f->r->per_dir_config, &proxy_html_module); + const char* force = apr_table_get(f->r->subprocess_env, "PROXY_HTML_FORCE"); const char* errmsg = NULL ; - if ( ! f->r->proxyreq ) { - errmsg = "Non-proxy request; not inserting proxy-html filter" ; - } else if ( ! f->r->content_type ) { - errmsg = "No content-type; bailing out of proxy-html filter" ; - } else if ( strncasecmp(f->r->content_type, "text/html", 9) && - strncasecmp(f->r->content_type, "application/xhtml+xml", 21) ) { - errmsg = "Non-HTML content; not inserting proxy-html filter" ; + if ( !force ) { + if ( ! f->r->proxyreq ) { + errmsg = "Non-proxy request; not inserting proxy-html filter" ; + } else if ( ! f->r->content_type ) { + errmsg = "No content-type; bailing out of proxy-html filter" ; + } else if ( strncasecmp(f->r->content_type, "text/html", 9) && + strncasecmp(f->r->content_type, "application/xhtml+xml", 21) ) { + errmsg = "Non-HTML content; not inserting proxy-html filter" ; + } } if ( errmsg ) { #ifndef GO_FASTER - proxy_html_conf* cfg - = ap_get_module_config(f->r->per_dir_config, &proxy_html_module); if ( cfg->verbose ) { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r, errmsg) ; } @@ -788,11 +1029,27 @@ static saxctxt* check_filter_init (ap_filter_t* f) { ap_remove_output_filter(f) ; return NULL ; } - if ( ! f->ctx ) - proxy_html_filter_init(f) ; + + if ( ! f->ctx) { + fctx = f->ctx = apr_pcalloc(f->r->pool, sizeof(saxctxt)) ; + fctx->f = f ; + fctx->bb = apr_brigade_create(f->r->pool, f->r->connection->bucket_alloc) ; + fctx->cfg = cfg; + apr_table_unset(f->r->headers_out, "Content-Length") ; + + if (cfg->interp) + fixup_rules(fctx); + else + fctx->map = cfg->map; + /* defer dealing with charset_out until after sniffing charset_in + * so we can support setting one to t'other. + */ + } return f->ctx ; } static int proxy_html_filter(ap_filter_t* f, apr_bucket_brigade* bb) { + apr_xlate_t* convset; + const char* charset = NULL; apr_bucket* b ; meta* m = NULL ; xmlCharEncoding enc ; @@ -800,90 +1057,152 @@ static int proxy_html_filter(ap_filter_t* f, apr_bucket_brigade* bb) { apr_size_t bytes = 0 ; #ifndef USE_OLD_LIBXML2 int xmlopts = XML_PARSE_RECOVER | XML_PARSE_NONET | - XML_PARSE_NOBLANKS | XML_PARSE_NOERROR | XML_PARSE_NOWARNING ; + XML_PARSE_NOBLANKS | XML_PARSE_NOERROR | XML_PARSE_NOWARNING ; #endif saxctxt* ctxt = check_filter_init(f) ; +#ifndef GO_FASTER + int verbose; +#endif if ( ! ctxt ) return ap_pass_brigade(f->next, bb) ; +#ifndef GO_FASTER + verbose = ctxt->cfg->verbose; +#endif for ( b = APR_BRIGADE_FIRST(bb) ; - b != APR_BRIGADE_SENTINEL(bb) ; - b = APR_BUCKET_NEXT(b) ) { - if ( APR_BUCKET_IS_EOS(b) ) { - if ( ctxt->parser != NULL ) { - htmlParseChunk(ctxt->parser, buf, 0, 1) ; + b != APR_BRIGADE_SENTINEL(bb) ; + b = APR_BUCKET_NEXT(b) ) { + if ( APR_BUCKET_IS_METADATA(b) ) { + if ( APR_BUCKET_IS_EOS(b) ) { + if ( ctxt->parser != NULL ) { + consume_buffer(ctxt, buf, 0, 1); + } + APR_BRIGADE_INSERT_TAIL(ctxt->bb, + apr_bucket_eos_create(ctxt->bb->bucket_alloc) ) ; + ap_pass_brigade(ctxt->f->next, ctxt->bb) ; + } else if ( APR_BUCKET_IS_FLUSH(b) ) { + /* pass on flush, except at start where it would cause + * headers to be sent before doc sniffing + */ + if ( ctxt->parser != NULL ) { + ap_fflush(ctxt->f->next, ctxt->bb) ; + } } - APR_BRIGADE_INSERT_TAIL(ctxt->bb, - apr_bucket_eos_create(ctxt->bb->bucket_alloc) ) ; - ap_pass_brigade(ctxt->f->next, ctxt->bb) ; - } else if ( ! APR_BUCKET_IS_METADATA(b) && - apr_bucket_read(b, &buf, &bytes, APR_BLOCK_READ) - == APR_SUCCESS ) { + } else if ( apr_bucket_read(b, &buf, &bytes, APR_BLOCK_READ) + == APR_SUCCESS ) { if ( ctxt->parser == NULL ) { - if ( buf && buf[bytes] != 0 ) { - /* make a string for parse routines to play with */ - char* buf1 = apr_palloc(f->r->pool, bytes+1) ; - memcpy(buf1, buf, bytes) ; - buf1[bytes] = 0 ; - buf = buf1 ; + if ( buf[bytes] != 0 ) { + /* make a string for parse routines to play with */ + char* buf1 = apr_palloc(f->r->pool, bytes+1) ; + memcpy(buf1, buf, bytes) ; + buf1[bytes] = 0 ; + buf = buf1 ; + } + /* For publishing systems that insert crap at the head of a + * page that buggers up the parser. Search to first instance + * of some relatively sane, or at least parseable, element. + */ + if (ctxt->cfg->skipto != NULL) { + char* p = ap_strchr_c(buf, '<'); + tattr* starts = (tattr*) ctxt->cfg->skipto->elts; + int found = 0; + while (!found && *p) { + int i; + for (i = 0; i < ctxt->cfg->skipto->nelts; ++i) { + if ( !strncasecmp(p+1, starts[i].val, strlen(starts[i].val))) { + bytes -= (p-buf); + buf = p ; + found = 1; + VERBOSE( + ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, f->r, + "Skipped to first <%s> element", starts[i].val) + ) ; + break; + } + } + p = ap_strchr_c(p+1, '<'); + } + if (p == NULL) { + ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, f->r, + "Failed to find start of recognised HTML!") ; + } + } + + enc = sniff_encoding(ctxt, buf, bytes) ; + /* now we have input charset, set output charset too */ + if (ctxt->cfg->charset_out) { + if (!strcmp(ctxt->cfg->charset_out, "*")) + charset = ctxt->encoding; + else + charset = ctxt->cfg->charset_out; + if (strcasecmp(charset, "utf-8")) { + if (apr_xlate_open(&convset, charset, "UTF-8", + f->r->pool) == APR_SUCCESS) { + ctxt->conv_out = apr_pcalloc(f->r->pool, sizeof(conv_t)); + ctxt->conv_out->convset = convset; + } else { + ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, f->r, + "Output charset %s not supported. Falling back to UTF-8", + charset) ; + } + } } + if (ctxt->conv_out) { + const char* ctype = apr_psprintf(f->r->pool, + "text/html;charset=%s", charset); + ap_set_content_type(f->r, ctype) ; + } else { + ap_set_content_type(f->r, "text/html;charset=utf-8") ; + } + ap_fputs(f->next, ctxt->bb, ctxt->cfg->doctype) ; + ctxt->parser = htmlCreatePushParserCtxt(&sax, ctxt, buf, 4, 0, enc) ; + buf += 4; + bytes -= 4; + if (ctxt->parser == NULL) { + apr_status_t rv = ap_pass_brigade(f->next, bb) ; + ap_remove_output_filter(f) ; + return rv; + } + apr_pool_cleanup_register(f->r->pool, ctxt->parser, + (void*)htmlFreeParserCtxt, apr_pool_cleanup_null) ; +#ifndef USE_OLD_LIBXML2 + if ( xmlopts = xmlCtxtUseOptions(ctxt->parser, xmlopts ), xmlopts ) + ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, f->r, + "Unsupported parser opts %x", xmlopts) ; +#endif + if ( ctxt->cfg->metafix ) #ifndef GO_FASTER - enc = sniff_encoding(f->r, buf, bytes, ctxt->cfg->verbose) ; - if ( ctxt->cfg->metafix ) - m = metafix(f->r, buf, ctxt->cfg->verbose) ; + m = metafix(f->r, buf, ctxt->cfg->verbose) ; #else - enc = sniff_encoding(f->r, buf, bytes) ; - if ( ctxt->cfg->metafix ) - m = metafix(f->r, buf) ; -#endif - ap_set_content_type(f->r, "text/html;charset=utf-8") ; - ap_fputs(f->next, ctxt->bb, ctxt->cfg->doctype) ; - if ( m ) { - ctxt->parser = htmlCreatePushParserCtxt(ctxt->sax, ctxt, - buf, m->start, 0, enc ) ; - htmlParseChunk(ctxt->parser, buf+m->end, bytes-m->end, 0) ; - } else { - ctxt->parser = htmlCreatePushParserCtxt(ctxt->sax, ctxt, - buf, bytes, 0, enc ) ; - } - apr_pool_cleanup_register(f->r->pool, ctxt->parser, - (void*)htmlFreeParserCtxt, apr_pool_cleanup_null) ; -#ifndef USE_OLD_LIBXML2 - if ( xmlopts = xmlCtxtUseOptions(ctxt->parser, xmlopts ), xmlopts ) - ap_log_rerror(APLOG_MARK, APLOG_WARNING, 0, f->r, - "Unsupported parser opts %x", xmlopts) ; + m = metafix(f->r, buf) ; #endif + if ( m ) { + consume_buffer(ctxt, buf, m->start, 0) ; + consume_buffer(ctxt, buf+m->end, bytes-m->end, 0) ; + } else { + consume_buffer(ctxt, buf, bytes, 0) ; + } } else { - htmlParseChunk(ctxt->parser, buf, bytes, 0) ; + consume_buffer(ctxt, buf, bytes, 0) ; } } else { ap_log_rerror(APLOG_MARK, APLOG_ERR, 0, f->r, "Error in bucket read") ; } } - /*ap_fflush(ctxt->f->next, ctxt->bb) ; // uncomment for debug */ + /*ap_fflush(ctxt->f->next, ctxt->bb) ; // uncomment for debug */ apr_brigade_cleanup(bb) ; return APR_SUCCESS ; } -static const char* fpi_html = - "\n" ; -static const char* fpi_html_legacy = - "\n" ; -static const char* fpi_xhtml = - "\n" ; -static const char* fpi_xhtml_legacy = - "\n" ; -static const char* html_etag = ">" ; -static const char* xhtml_etag = " />" ; -/*#define DEFAULT_DOCTYPE fpi_html */ -static const char* DEFAULT_DOCTYPE = "" ; -#define DEFAULT_ETAG html_etag static void* proxy_html_config(apr_pool_t* pool, char* x) { proxy_html_conf* ret = apr_pcalloc(pool, sizeof(proxy_html_conf) ) ; ret->doctype = DEFAULT_DOCTYPE ; ret->etag = DEFAULT_ETAG ; ret->bufsz = 8192 ; + ret->default_encoding = XML_CHAR_ENCODING_NONE ; + /* ret->interp = 1; */ + /* don't initialise links and events until they get set/used */ return ret ; } static void* proxy_html_merge(apr_pool_t* pool, void* BASE, void* ADD) { @@ -891,6 +1210,15 @@ static void* proxy_html_merge(apr_pool_t* pool, void* BASE, void* ADD) { proxy_html_conf* add = (proxy_html_conf*) ADD ; proxy_html_conf* conf = apr_palloc(pool, sizeof(proxy_html_conf)) ; + /* don't merge declarations - just use the most specific */ + conf->links = (add->links == NULL) ? base->links : add->links; + conf->events = (add->events == NULL) ? base->events : add->events; + + conf->default_encoding = (add->default_encoding == XML_CHAR_ENCODING_NONE) + ? base->default_encoding : add->default_encoding ; + conf->charset_out = (add->charset_out == NULL) + ? base->charset_out : add->charset_out ; + if ( add->map && base->map ) { urlmap* a ; conf->map = NULL ; @@ -908,14 +1236,16 @@ static void* proxy_html_merge(apr_pool_t* pool, void* BASE, void* ADD) { conf->map = add->map ? add->map : base->map ; conf->doctype = ( add->doctype == DEFAULT_DOCTYPE ) - ? base->doctype : add->doctype ; + ? base->doctype : add->doctype ; conf->etag = ( add->etag == DEFAULT_ETAG ) ? base->etag : add->etag ; conf->bufsz = add->bufsz ; if ( add->flags & NORM_RESET ) { conf->flags = add->flags ^ NORM_RESET ; conf->metafix = add->metafix ; conf->extfix = add->extfix ; + conf->interp = add->interp ; conf->strip_comments = add->strip_comments ; + conf->skipto = add->skipto ; #ifndef GO_FASTER conf->verbose = add->verbose ; #endif @@ -923,56 +1253,101 @@ static void* proxy_html_merge(apr_pool_t* pool, void* BASE, void* ADD) { conf->flags = base->flags | add->flags ; conf->metafix = base->metafix | add->metafix ; conf->extfix = base->extfix | add->extfix ; + conf->interp = base->interp | add->interp ; conf->strip_comments = base->strip_comments | add->strip_comments ; + conf->skipto = add->skipto ? add->skipto : base->skipto ; #ifndef GO_FASTER conf->verbose = base->verbose | add->verbose ; #endif } return conf ; } -#define REGFLAG(n,s,c) ( (s&&(ap_strchr((char*)(s),(c))!=NULL)) ? (n) : 0 ) -#define XREGFLAG(n,s,c) ( (!s||(ap_strchr((char*)(s),(c))==NULL)) ? (n) : 0 ) -static const char* set_urlmap(cmd_parms* cmd, void* CFG, - const char* from, const char* to, const char* flags) { - int regflags ; - proxy_html_conf* cfg = (proxy_html_conf*)CFG ; - urlmap* map ; - urlmap* newmap = apr_palloc(cmd->pool, sizeof(urlmap) ) ; - - newmap->next = NULL ; +#define REGFLAG(n,s,c) ( (s&&(ap_strchr_c((s),(c))!=NULL)) ? (n) : 0 ) +#define XREGFLAG(n,s,c) ( (!s||(ap_strchr_c((s),(c))==NULL)) ? (n) : 0 ) +static void comp_urlmap(apr_pool_t* pool, urlmap* newmap, + const char* from, const char* to, const char* flags, const char* cond) { + char* eq; newmap->flags - = XREGFLAG(M_HTML,flags,'h') - | XREGFLAG(M_EVENTS,flags,'e') - | XREGFLAG(M_CDATA,flags,'c') - | REGFLAG(M_ATSTART,flags,'^') - | REGFLAG(M_ATEND,flags,'$') - | REGFLAG(M_REGEX,flags,'R') - | REGFLAG(M_LAST,flags,'L') + = XREGFLAG(M_HTML,flags,'h') + | XREGFLAG(M_EVENTS,flags,'e') + | XREGFLAG(M_CDATA,flags,'c') + | REGFLAG(M_ATSTART,flags,'^') + | REGFLAG(M_ATEND,flags,'$') + | REGFLAG(M_REGEX,flags,'R') + | REGFLAG(M_LAST,flags,'L') + | REGFLAG(M_NOTLAST,flags,'l') + | REGFLAG(M_INTERPOLATE_TO,flags,'V') + | REGFLAG(M_INTERPOLATE_FROM,flags,'v') ; + if ( ( newmap->flags & M_INTERPOLATE_FROM) + || ! (newmap->flags & M_REGEX) ) { + newmap->from.c = from ; + newmap->to = to ; + } else { + newmap->regflags + = REGFLAG(AP_REG_EXTENDED,flags,'x') + | REGFLAG(AP_REG_ICASE,flags,'i') + | REGFLAG(AP_REG_NOSUB,flags,'n') + | REGFLAG(AP_REG_NEWLINE,flags,'s') + ; + newmap->from.r = ap_pregcomp(pool, from, newmap->regflags) ; + newmap->to = to ; + } + if (cond != NULL) { + newmap->cond = apr_pcalloc(pool, sizeof(rewritecond)); + if (cond[0] == '!') { + newmap->cond->rel = -1; + newmap->cond->env = cond+1; + } else { + newmap->cond->rel = 1; + newmap->cond->env = cond; + } + eq = ap_strchr_c(++cond, '='); + if (eq && (eq != cond)) { + *eq = 0; + newmap->cond->val = eq+1; + } + } else { + newmap->cond = NULL; + } +} +static const char* set_urlmap(cmd_parms* cmd, void* CFG, const char* args) { + proxy_html_conf* cfg = (proxy_html_conf*)CFG ; + urlmap* map ; + apr_pool_t* pool = cmd->pool; + urlmap* newmap ; + const char* usage = + "Usage: ProxyHTMLURLMap from-pattern to-pattern [flags] [cond]"; + const char* from; + const char* to; + const char* flags; + const char* cond = NULL; + + if (from = ap_getword_conf(cmd->pool, &args), !from) + return usage; + if (to = ap_getword_conf(cmd->pool, &args), !to) + return usage; + flags = ap_getword_conf(cmd->pool, &args); + if (flags && *flags) + cond = ap_getword_conf(cmd->pool, &args); + if (cond && !*cond) + cond = NULL; + /* the args look OK, so let's use them */ + newmap = apr_palloc(pool, sizeof(urlmap) ) ; + newmap->next = NULL; if ( cfg->map ) { for ( map = cfg->map ; map->next ; map = map->next ) ; map->next = newmap ; } else cfg->map = newmap ; - if ( ! (newmap->flags & M_REGEX) ) { - newmap->from.c = apr_pstrdup(cmd->pool, from) ; - newmap->to = apr_pstrdup(cmd->pool, to) ; - } else { - regflags - = REGFLAG(AP_REG_EXTENDED,flags,'x') - | REGFLAG(AP_REG_ICASE,flags,'i') - | REGFLAG(AP_REG_NOSUB,flags,'n') - | REGFLAG(AP_REG_NEWLINE,flags,'s') - ; - newmap->from.r = ap_pregcomp(cmd->pool, from, regflags) ; - newmap->to = apr_pstrdup(cmd->pool, to) ; - } - return NULL ; + comp_urlmap(cmd->pool, newmap, from, to, flags, cond); + return NULL; } + static const char* set_doctype(cmd_parms* cmd, void* CFG, const char* t, - const char* l) { + const char* l) { proxy_html_conf* cfg = (proxy_html_conf*)CFG ; if ( !strcasecmp(t, "xhtml") ) { cfg->etag = xhtml_etag ; @@ -995,7 +1370,8 @@ static const char* set_doctype(cmd_parms* cmd, void* CFG, const char* t, } return NULL ; } -static void set_param(proxy_html_conf* cfg, const char* arg) { +static const char* set_flags(cmd_parms* cmd, void* CFG, const char* arg) { + proxy_html_conf* cfg = CFG; if ( arg && *arg ) { if ( !strcmp(arg, "lowercase") ) cfg->flags |= NORM_LC ; @@ -1004,58 +1380,141 @@ static void set_param(proxy_html_conf* cfg, const char* arg) { else if ( !strcmp(arg, "reset") ) cfg->flags |= NORM_RESET ; } + return NULL ; } -static const char* set_flags(cmd_parms* cmd, void* CFG, const char* arg1, - const char* arg2, const char* arg3) { - set_param( (proxy_html_conf*)CFG, arg1) ; - set_param( (proxy_html_conf*)CFG, arg2) ; - set_param( (proxy_html_conf*)CFG, arg3) ; +static const char* set_events(cmd_parms* cmd, void* CFG, const char* arg) { + tattr* attr; + proxy_html_conf* cfg = CFG; + if (cfg->events == NULL) + cfg->events = apr_array_make(cmd->pool, 20, sizeof(tattr)); + attr = apr_array_push(cfg->events) ; + attr->val = arg; return NULL ; } +static const char* set_skipto(cmd_parms* cmd, void* CFG, const char* arg) { + tattr* attr; + proxy_html_conf* cfg = CFG; + if (cfg->skipto == NULL) + cfg->skipto = apr_array_make(cmd->pool, 4, sizeof(tattr)); + attr = apr_array_push(cfg->skipto) ; + attr->val = arg; + return NULL ; +} +static const char* set_links(cmd_parms* cmd, void* CFG, + const char* elt, const char* att) { + apr_array_header_t* attrs; + tattr* attr ; + proxy_html_conf* cfg = CFG; + + if (cfg->links == NULL) + cfg->links = apr_hash_make(cmd->pool); + + attrs = apr_hash_get(cfg->links, elt, APR_HASH_KEY_STRING) ; + if (!attrs) { + attrs = apr_array_make(cmd->pool, 2, sizeof(tattr*)) ; + apr_hash_set(cfg->links, elt, APR_HASH_KEY_STRING, attrs) ; + } + attr = apr_array_push(attrs) ; + attr->val = att ; + return NULL ; +} +static const char* set_charset_alias(cmd_parms* cmd, void* CFG, + const char* charset, const char* alias) { + const char* errmsg = ap_check_cmd_context(cmd, GLOBAL_ONLY); + if (errmsg != NULL) + return errmsg ; + else if (xmlAddEncodingAlias(charset, alias) == 0) + return NULL; + else + return "Error setting charset alias"; +} +static const char* set_charset_default(cmd_parms* cmd, void* CFG, + const char* charset) { + proxy_html_conf* cfg = CFG; + cfg->default_encoding = xmlParseCharEncoding(charset); + switch(cfg->default_encoding) { + case XML_CHAR_ENCODING_NONE: + return "Default charset not found"; + case XML_CHAR_ENCODING_ERROR: + return "Invalid or unsupported default charset"; + default: + return NULL; + } +} static const command_rec proxy_html_cmds[] = { - AP_INIT_TAKE23("ProxyHTMLURLMap", set_urlmap, NULL, - RSRC_CONF|ACCESS_CONF, "Map URL From To" ) , + AP_INIT_ITERATE("ProxyHTMLStartParse", set_skipto, NULL, + RSRC_CONF|ACCESS_CONF, + "Ignore anything in front of the first of these elements"), + AP_INIT_ITERATE("ProxyHTMLEvents", set_events, NULL, + RSRC_CONF|ACCESS_CONF, "Strings to be treated as scripting events"), + AP_INIT_ITERATE2("ProxyHTMLLinks", set_links, NULL, + RSRC_CONF|ACCESS_CONF, "Declare HTML Attributes"), + AP_INIT_RAW_ARGS("ProxyHTMLURLMap", set_urlmap, NULL, + RSRC_CONF|ACCESS_CONF, "Map URL From To" ) , AP_INIT_TAKE12("ProxyHTMLDoctype", set_doctype, NULL, - RSRC_CONF|ACCESS_CONF, "(HTML|XHTML) [Legacy]" ) , - AP_INIT_TAKE123("ProxyHTMLFixups", set_flags, NULL, - RSRC_CONF|ACCESS_CONF, "Options are lowercase, dospath" ) , + RSRC_CONF|ACCESS_CONF, "(HTML|XHTML) [Legacy]" ) , + AP_INIT_ITERATE("ProxyHTMLFixups", set_flags, NULL, + RSRC_CONF|ACCESS_CONF, "Options are lowercase, dospath" ) , AP_INIT_FLAG("ProxyHTMLMeta", ap_set_flag_slot, - (void*)APR_OFFSETOF(proxy_html_conf, metafix), - RSRC_CONF|ACCESS_CONF, "Fix META http-equiv elements" ) , + (void*)APR_OFFSETOF(proxy_html_conf, metafix), + RSRC_CONF|ACCESS_CONF, "Fix META http-equiv elements" ) , + AP_INIT_FLAG("ProxyHTMLInterp", ap_set_flag_slot, + (void*)APR_OFFSETOF(proxy_html_conf, interp), + RSRC_CONF|ACCESS_CONF, + "Support interpolation and conditions in URLMaps" ) , AP_INIT_FLAG("ProxyHTMLExtended", ap_set_flag_slot, - (void*)APR_OFFSETOF(proxy_html_conf, extfix), - RSRC_CONF|ACCESS_CONF, "Map URLs in Javascript and CSS" ) , + (void*)APR_OFFSETOF(proxy_html_conf, extfix), + RSRC_CONF|ACCESS_CONF, "Map URLs in Javascript and CSS" ) , AP_INIT_FLAG("ProxyHTMLStripComments", ap_set_flag_slot, - (void*)APR_OFFSETOF(proxy_html_conf, strip_comments), - RSRC_CONF|ACCESS_CONF, "Strip out comments" ) , + (void*)APR_OFFSETOF(proxy_html_conf, strip_comments), + RSRC_CONF|ACCESS_CONF, "Strip out comments" ) , #ifndef GO_FASTER AP_INIT_FLAG("ProxyHTMLLogVerbose", ap_set_flag_slot, - (void*)APR_OFFSETOF(proxy_html_conf, verbose), - RSRC_CONF|ACCESS_CONF, "Verbose Logging (use with LogLevel Info)" ) , + (void*)APR_OFFSETOF(proxy_html_conf, verbose), + RSRC_CONF|ACCESS_CONF, "Verbose Logging (use with LogLevel Info)" ) , #endif AP_INIT_TAKE1("ProxyHTMLBufSize", ap_set_int_slot, - (void*)APR_OFFSETOF(proxy_html_conf, bufsz), - RSRC_CONF|ACCESS_CONF, "Buffer size" ) , + (void*)APR_OFFSETOF(proxy_html_conf, bufsz), + RSRC_CONF|ACCESS_CONF, "Buffer size" ) , + AP_INIT_ITERATE2("ProxyHTMLCharsetAlias", set_charset_alias, NULL, + RSRC_CONF, "ProxyHTMLCharsetAlias charset alias [more aliases]" ) , + AP_INIT_TAKE1("ProxyHTMLCharsetDefault", set_charset_default, NULL, + RSRC_CONF|ACCESS_CONF, "Usage: ProxyHTMLCharsetDefault charset" ) , + AP_INIT_TAKE1("ProxyHTMLCharsetOut", ap_set_string_slot, + (void*)APR_OFFSETOF(proxy_html_conf, charset_out), + RSRC_CONF|ACCESS_CONF, "Usage: ProxyHTMLCharsetOut charset" ) , { NULL } } ; static int mod_proxy_html(apr_pool_t* p, apr_pool_t* p1, apr_pool_t* p2, - server_rec* s) { + server_rec* s) { ap_add_version_component(p, VERSION_STRING) ; + seek_meta_ctype = ap_pregcomp(p, + "(]*http-equiv[ \t\r\n='\"]*content-type[^>]*>)", + AP_REG_EXTENDED|AP_REG_ICASE) ; + seek_charset = ap_pregcomp(p, "charset=([A-Za-z0-9_-]+)", + AP_REG_EXTENDED|AP_REG_ICASE) ; + seek_meta = ap_pregcomp(p, "]*(http-equiv)[^>]*>", + AP_REG_EXTENDED|AP_REG_ICASE) ; + memset(&sax, 0, sizeof(htmlSAXHandler)); + sax.startElement = pstartElement ; + sax.endElement = pendElement ; + sax.characters = pcharacters ; + sax.comment = pcomment ; + sax.cdataBlock = pcdata ; return OK ; } static void proxy_html_hooks(apr_pool_t* p) { - ap_register_output_filter("proxy-html", proxy_html_filter, - NULL, AP_FTYPE_RESOURCE) ; + ap_register_output_filter_protocol("proxy-html", proxy_html_filter, + NULL, AP_FTYPE_RESOURCE, + AP_FILTER_PROTO_CHANGE|AP_FILTER_PROTO_CHANGE_LENGTH) ; ap_hook_post_config(mod_proxy_html, NULL, NULL, APR_HOOK_MIDDLE) ; - ap_hook_child_init(proxy_html_child_init, NULL, NULL, APR_HOOK_MIDDLE) ; } module AP_MODULE_DECLARE_DATA proxy_html_module = { - STANDARD20_MODULE_STUFF, - proxy_html_config, - proxy_html_merge, - NULL, - NULL, - proxy_html_cmds, - proxy_html_hooks + STANDARD20_MODULE_STUFF, + proxy_html_config, + proxy_html_merge, + NULL, + NULL, + proxy_html_cmds, + proxy_html_hooks } ; - diff --git a/proxy_html.conf b/proxy_html.conf new file mode 100644 index 0000000..4e9367e --- /dev/null +++ b/proxy_html.conf @@ -0,0 +1,62 @@ +# Configuration example. +# +# First, to load the module with its prerequisites +# +# For Unix-family systems: +# LoadFile /usr/lib/libxml2.so +# LoadModule proxy_html_module modules/mod_proxy_html.so +# +# For Windows (I don't know if there's a standard path for the libraries) +# LoadFile C:/path/zlib.dll +# LoadFile C:/path/iconv.dll +# LoadFile C:/path/libxml2.dll +# LoadModule proxy_html_module modules/mod_proxy_html.so +# +# All knowledge of HTML links has been removed from the mod_proxy_html +# code itself, and is instead read from httpd.conf (or included file) +# at server startup. So you MUST declare it. This will normally be +# at top level, but can also be used in a . +# +# Here's the declaration for W3C HTML 4.01 and XHTML 1.0 + +ProxyHTMLLinks a href +ProxyHTMLLinks area href +ProxyHTMLLinks link href +ProxyHTMLLinks img src longdesc usemap +ProxyHTMLLinks object classid codebase data usemap +ProxyHTMLLinks q cite +ProxyHTMLLinks blockquote cite +ProxyHTMLLinks ins cite +ProxyHTMLLinks del cite +ProxyHTMLLinks form action +ProxyHTMLLinks input src usemap +ProxyHTMLLinks head profile +ProxyHTMLLinks base href +ProxyHTMLLinks script src for + +# To support scripting events (with ProxyHTMLExtended On), +# you'll need to declare them too. + +ProxyHTMLEvents onclick ondblclick onmousedown onmouseup \ + onmouseover onmousemove onmouseout onkeypress \ + onkeydown onkeyup onfocus onblur onload \ + onunload onsubmit onreset onselect onchange + +# If you need to support legacy (pre-1998, aka "transitional") HTML or XHTML, +# you'll need to uncomment the following deprecated link attributes. +# Note that these are enabled in earlier mod_proxy_html versions +# +# ProxyHTMLLinks frame src longdesc +# ProxyHTMLLinks iframe src longdesc +# ProxyHTMLLinks body background +# ProxyHTMLLinks applet codebase +# +# If you're dealing with proprietary HTML variants, +# declare your own URL attributes here as required. +# +# ProxyHTMLLinks myelement myattr otherattr +# +# Also at top level in httpd.conf, you can declare charset aliases. +# This is the most efficient way to support encodings that libxml2 +# doesn't natively support. See the documentation at +# http://apache.webthing.com/mod_proxy_html/