You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@trafficserver.apache.org by Thomas Jackson <ja...@gmail.com> on 2014/03/25 21:50:07 UTC

Remap/regex_remap consolidation

Here at LinkedIn we've been using regular remap.config for a while (with
all our map options). One thing we've been looking into recently is path
based regexes (which regex_remap supports). While looking into it we found
a few shortcomings of the plugin-- and decided it would probably be better
for everyone involved if we could come up with a way to consolidate this
into regular remap. This raises a few questions about how to consolidate,
so we figure we'd solicit some feedback from the community before we get
started. Since there are quite a few changes we're considering I've tried
to assemble examples for all the scenarios, but to start out I'll put some
examples of how remaps look today.

*Standard remap.config example*

# ExampleA: match all domains coming in on a specific port with a path
regex_map_with_recv_port http://.*:8080/bar    http://dest.com:12345/bar
# ExampleB:  match all domains /foo regardless of port
regex_map http://.*/foo    http://dest.com:12345/foo
# ExampleC: match everything else on a specific domain
map http://foo.com/    http://127.0.0.1:12345/catchall

In regular remap.config some features are missing-- such as regex matching
based on the path. If i were to use regex_remap on a regular mapping rule
like so:

regex_map_with_recv_port http://.*:8080
http://dest.com/@plugin=/usr/libexec/regex_remap.so
@pparam=mapfile.map

I could then have a file (mapfile.map) which would look something like:

### map file contents
# regular remap
^/foo(.*)                     http://dest.com:12345/foo/$1
# strip the query string
^/foo(.*)(\?.*)?               http://dest.com:12345/foo/$1?$q
# a redirect
/oldpath(.*)                   http://newdomain.com:8080/newpath$1@status=302


This regex_remap markup gives us a few nice things. This gives you regex
matching on the path, which we found to be extremely useful. Specifically
we have a use case for /foo and /foo/ to go to an app, but not /foobar. In
addition regex_remap map files give you a cleaner markup for redirects
(@status=xxx), as well as per-remap-line config overrides. As we started
looking into it we realized that regex_remap (the plugin) is a bit limited
since you cannot use remap plugins within the map file. We then started
looking into adding that, but figured it might be less work (and more
helpful) to merge this into ATS propper. So when merging these features,
there are a few config questions to be answered.


*Question #1 how many regexes?*
Today there is one regex (in all the different regex_* types) which is only
on the domain name. In regex_remap there is one regex, but it matches the
path and the query string. So the question is how many different regexes
should their be?

To get some background of what these look like I'll implement a rule where
we match http/https on all domains, ports 8081 and 8082, and all paths.

The main ones we've thought of so far are:

    *1 regex*- so the entire string you are matching on is one big regex
        ^https?://.*:808[12]/(.*)$

        pros: fairly simple, dense (4 lines of today's configs can be
merged into 1)
        cons: easy to mess up (lots to match at once). Fairly difficult to
tell what its doing

    *2 regex*- one for domain, and one for path
        http://.*:8081/(.*)$
        http://.*:8082/(.*)$
        https://.*:8081/(.*)$
        https://.*:8082/(.*)$

        pros: Closer matches what we do today
        cons: more verbose (can't regex the scheme or port)

    *4 regex*- one for schema, domain, port, and path
        ^https?$://^.*$:^808[12]$/^(.*)$

        pros: seperation of the various regexes, dense, impossible to
capture more than the field you are in
        cons: 4 regexes instead of 1 (might be more confusing?)

*Note*: any number of regexes >1 will require named capture groups (
http://www.regular-expressions.info/named.html). Which means that you
cannot use $1, $2, etc. In a lot of ways this is nicer (more explicit) but
it is a change.

*What do I think?*
I prefer #2 or #4 as they help seperate the regex matching into smaller
regexes (which should make finding non-matching regexes faster) as well as
make the regexes more scoped-- and hopefully harder to mess up (especially
by matching too much).


*Question #2 Explicit vs implicit regexes*
If we decide to have more than one regex (from #1), do we want all of them
to be implicitly handled as regex strings? Or do we want to rely on some
anchoring syntax to flag to the remap engine that the string is a regex
(such as requiring regexes to start with a '^' and end with a '$').

    Explicit:
        pros: clearer that the field is a regex, easier to optimize the
remap engine
        cons: requires more markup, and would mean that if you just put in
.* it would be a string match
    Implicit:
        pros: closer matches the markup we have today, simpler configs
        cons: wasteful if most fields are strings

*What do I think?*
In general (and in this specific case) I like explicit over implicit since
its more clear what you are doing. Especially if we pick 4 regexes (from
#1) this would allow you to effectively "disable" regex matching within
specific fields if you don't need it.


*Question #3 How to handle query strings in the match?*
Today the query string is not part of remap.config. In regex_remap it is
optional based on a @pparam=no-query-string. The advantage of not having it
in the path is you don't have to worry about matching it accidentally or
reconstructing it. The downside is that you can never match on it. This
could be controlled by some @ parameter, but that could make remap.config a
bit confusing since it wouldn't be consistent.

*What do I think?*
I would like to leave the query parameters out, or at least have some @
parameter to disable them on a per-line basis (since I don't want to have
rules matching on query params).


*Question #4 How do you *drop* the query string?*
If for #3 we do decide to put the query string in the match, how would we
drop it on a specific path? In the regex_remap plugin I'd simply create a
rule such as:

# remove query params in regex_remap
^/(.*)(\?.*)?  /$1

And by not adding ?$2 (or something similar) I'd be removing the query
parameters. Conversely, if I want to keep the query parameters that means i
effectively have to re-add them every remap line like so:

# keep the query params
^/(.*)(\?.*)?  /$1/$2
# or with the shorthand
^/(.*)(\?.*)?  /$1/$q

*What do I think?*
No real opinion on this one, mostly because I don't really want the query
params in the matching string to begin with :)


*#5 How to remap paths that are regexed*
Today we have remap lines that look something like:
regex_map http://.*/foo http://dest.com:12345/

What this means is that any request with path starting with "/foo" will be
remapped and the "/foo" will be replaced with "/". Once we allow regexes in
the path we have more variables to take into account. If we take this same
simple case with regex_remap it would look something like:
/foo(.*)  http://dest.com:12345/$1

If we were to mimic this markup in remap.config it could look something
like:
regex_map http://.*/foo(.*) http://dest.com:12345/$1

This is a bit more explicit in what it is doing, but in this simple case it
means quite a few more characters to get the same meaning across. What this
gets you is the flexibility to do more complex regex matches if needed. If
we do go with markup like this and we pick anything more than 1 regex from
#1 we'd be forced to use named groups instead of $1, $2, etc. since the
strings matched would come from different regexes


*What do I think?*
If we used 4 regexes (from #1) and explicit (from #2) we could keep the
markup pretty similar to what we have now and still have the ability to do
some cool things.

So, for the base case of regex_map today (regex domain only) it would look
like:
regex_map http://^.*$/foo http://dest.com:12345/foo

This would only regex the domain name (since it starts with ^ and ends with
$) and then the path would be treated like a regular map (same as it is
today). If i wanted to do some regexing based on the path I could write
something like:

regex_map http://^.*$^/foo(?P<num>\d+)$ http://dest.com:12345/$num/foo




If you have any opinions/feedback about these specifics please let me
know-- we're hoping to nail down the markup fairly quickly and get this
taken care of in the next few weeks. If you have questions about what
markup would look like (and don't want to spam the mailing list) feel free
to mail me individually or PM me on IRC (jacksontj).


Thomas Jackson
Traffic SRE @ LinkedIn

Re: Remap/regex_remap consolidation

Posted by Leif Hedstrom <zw...@apache.org>.


> On Mar 26, 2014, at 12:31 PM, Thomas Jackson <ja...@gmail.com> wrote:
> 
> The biggest performance gain I see of separate regexes is that I can
> execute the unique domain regexes which should resolve to a list of path
> regexes. This should be pretty big performance wise as a lot of rules in
> large remaps have the same domain which may or may not be a regex.

Agreed. That also retains some if the old regex_remap performance characteristics.

-- Leif 

>> On Mar 26, 2014 6:20 AM, "Leif Hedstrom" <zw...@apache.org> wrote:
>> 
>> 
>> 
>>> On Mar 26, 2014, at 2:02 AM, Brian Geffon <br...@apache.org> wrote:
>>> 
>>> Thomas, I somewhat agree: my guess would be the additional regexes would
>>> likely cancel any performance gain there.
>>> 
>>> Does anyone else have feedback or comments?
>> 
>> 
>> The other argument for this is that with separate regexes, you don't have
>> to create the full URL string representation. I don't know if the core has
>> any optimizations here, but for a plugin that is an expensive operation.
>> With separate regexes for host and path, this is a non-issue.
>> 
>> Maybe you can do separate regexes but let expansions cross over? $h[1] for
>> a host group etc.
>> 
>> Pros and cons :)
>> 
>> -- Leif
>> 
>>> 
>>> Brian
>>> 
>>>> On Tuesday, March 25, 2014, Thomas Jackson <ja...@gmail.com>
>> wrote:
>>>> 
>>>> Another consideration for having >1 regex (which may or may not be
>>>> premature optimization) is that if you have seperate regexes we can
>> create
>>>> hash maps similar to how maps work (a hashmap of domain_regex -> list of
>>>> path regexes) which would make overall remap performance faster/better.
>>>> 
>>>> 
>>>> On Tue, Mar 25, 2014 at 9:50 PM, Brian Geffon <briang@apache.org
>> <javascript:_e(%7B%7D,'cvml','briang@apache.org');>
>>>>> wrote:
>>>> 
>>>>> Right.
>>>>> 
>>>>> 
>>>>> On Tuesday, March 25, 2014, Leif Hedstrom <zwoop@apache.org
>> <javascript:_e(%7B%7D,'cvml','zwoop@apache.org');>>
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>>> On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>> What Thomas called Question #1 -- 1 Regex.
>>>>>> 
>>>>>> Makes sense to have them combined. Assuming groups etc. works, that
>>>>>> allows you to do e.g.
>>>>>> 
>>>>>>  regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
>>>>>> 
>>>>>> 
>>>>>> or some such. i.e. take parts from the path match and use as the host,
>>>>>> and vice versa. Right? :)
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> -- Leif
>>

Re: Remap/regex_remap consolidation

Posted by Leif Hedstrom <zw...@apache.org>.


> On Mar 26, 2014, at 12:31 PM, Thomas Jackson <ja...@gmail.com> wrote:
> 
> The biggest performance gain I see of separate regexes is that I can
> execute the unique domain regexes which should resolve to a list of path
> regexes. This should be pretty big performance wise as a lot of rules in
> large remaps have the same domain which may or may not be a regex.

Agreed. That also retains some if the old regex_remap performance characteristics.

-- Leif 

>> On Mar 26, 2014 6:20 AM, "Leif Hedstrom" <zw...@apache.org> wrote:
>> 
>> 
>> 
>>> On Mar 26, 2014, at 2:02 AM, Brian Geffon <br...@apache.org> wrote:
>>> 
>>> Thomas, I somewhat agree: my guess would be the additional regexes would
>>> likely cancel any performance gain there.
>>> 
>>> Does anyone else have feedback or comments?
>> 
>> 
>> The other argument for this is that with separate regexes, you don't have
>> to create the full URL string representation. I don't know if the core has
>> any optimizations here, but for a plugin that is an expensive operation.
>> With separate regexes for host and path, this is a non-issue.
>> 
>> Maybe you can do separate regexes but let expansions cross over? $h[1] for
>> a host group etc.
>> 
>> Pros and cons :)
>> 
>> -- Leif
>> 
>>> 
>>> Brian
>>> 
>>>> On Tuesday, March 25, 2014, Thomas Jackson <ja...@gmail.com>
>> wrote:
>>>> 
>>>> Another consideration for having >1 regex (which may or may not be
>>>> premature optimization) is that if you have seperate regexes we can
>> create
>>>> hash maps similar to how maps work (a hashmap of domain_regex -> list of
>>>> path regexes) which would make overall remap performance faster/better.
>>>> 
>>>> 
>>>> On Tue, Mar 25, 2014 at 9:50 PM, Brian Geffon <briang@apache.org
>> <javascript:_e(%7B%7D,'cvml','briang@apache.org');>
>>>>> wrote:
>>>> 
>>>>> Right.
>>>>> 
>>>>> 
>>>>> On Tuesday, March 25, 2014, Leif Hedstrom <zwoop@apache.org
>> <javascript:_e(%7B%7D,'cvml','zwoop@apache.org');>>
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>>> On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>> What Thomas called Question #1 -- 1 Regex.
>>>>>> 
>>>>>> Makes sense to have them combined. Assuming groups etc. works, that
>>>>>> allows you to do e.g.
>>>>>> 
>>>>>>  regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
>>>>>> 
>>>>>> 
>>>>>> or some such. i.e. take parts from the path match and use as the host,
>>>>>> and vice versa. Right? :)
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> -- Leif
>>

Re: Remap/regex_remap consolidation

Posted by Thomas Jackson <ja...@gmail.com>.

The biggest performance gain I see of separate regexes is that I can
execute the unique domain regexes which should resolve to a list of path
regexes. This should be pretty big performance wise as a lot of rules in
large remaps have the same domain which may or may not be a regex.
On Mar 26, 2014 6:20 AM, "Leif Hedstrom" <zw...@apache.org> wrote:

>
>
> > On Mar 26, 2014, at 2:02 AM, Brian Geffon <br...@apache.org> wrote:
> >
> > Thomas, I somewhat agree: my guess would be the additional regexes would
> > likely cancel any performance gain there.
> >
> > Does anyone else have feedback or comments?
>
>
> The other argument for this is that with separate regexes, you don't have
> to create the full URL string representation. I don't know if the core has
> any optimizations here, but for a plugin that is an expensive operation.
> With separate regexes for host and path, this is a non-issue.
>
> Maybe you can do separate regexes but let expansions cross over? $h[1] for
> a host group etc.
>
> Pros and cons :)
>
> -- Leif
>
> >
> > Brian
> >
> >> On Tuesday, March 25, 2014, Thomas Jackson <ja...@gmail.com>
> wrote:
> >>
> >> Another consideration for having >1 regex (which may or may not be
> >> premature optimization) is that if you have seperate regexes we can
> create
> >> hash maps similar to how maps work (a hashmap of domain_regex -> list of
> >> path regexes) which would make overall remap performance faster/better.
> >>
> >>
> >> On Tue, Mar 25, 2014 at 9:50 PM, Brian Geffon <briang@apache.org
> <javascript:_e(%7B%7D,'cvml','briang@apache.org');>
> >>> wrote:
> >>
> >>> Right.
> >>>
> >>>
> >>> On Tuesday, March 25, 2014, Leif Hedstrom <zwoop@apache.org
> <javascript:_e(%7B%7D,'cvml','zwoop@apache.org');>>
> >>> wrote:
> >>>
> >>>>
> >>>>> On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com>
> wrote:
> >>>>>
> >>>>> What Thomas called Question #1 -- 1 Regex.
> >>>>
> >>>> Makes sense to have them combined. Assuming groups etc. works, that
> >>>> allows you to do e.g.
> >>>>
> >>>>   regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
> >>>>
> >>>>
> >>>> or some such. i.e. take parts from the path match and use as the host,
> >>>> and vice versa. Right? :)
> >>>>
> >>>> Cheers,
> >>>>
> >>>> -- Leif
> >>
>

Re: Remap/regex_remap consolidation

Posted by Thomas Jackson <ja...@gmail.com>.

The biggest performance gain I see of separate regexes is that I can
execute the unique domain regexes which should resolve to a list of path
regexes. This should be pretty big performance wise as a lot of rules in
large remaps have the same domain which may or may not be a regex.
On Mar 26, 2014 6:20 AM, "Leif Hedstrom" <zw...@apache.org> wrote:

>
>
> > On Mar 26, 2014, at 2:02 AM, Brian Geffon <br...@apache.org> wrote:
> >
> > Thomas, I somewhat agree: my guess would be the additional regexes would
> > likely cancel any performance gain there.
> >
> > Does anyone else have feedback or comments?
>
>
> The other argument for this is that with separate regexes, you don't have
> to create the full URL string representation. I don't know if the core has
> any optimizations here, but for a plugin that is an expensive operation.
> With separate regexes for host and path, this is a non-issue.
>
> Maybe you can do separate regexes but let expansions cross over? $h[1] for
> a host group etc.
>
> Pros and cons :)
>
> -- Leif
>
> >
> > Brian
> >
> >> On Tuesday, March 25, 2014, Thomas Jackson <ja...@gmail.com>
> wrote:
> >>
> >> Another consideration for having >1 regex (which may or may not be
> >> premature optimization) is that if you have seperate regexes we can
> create
> >> hash maps similar to how maps work (a hashmap of domain_regex -> list of
> >> path regexes) which would make overall remap performance faster/better.
> >>
> >>
> >> On Tue, Mar 25, 2014 at 9:50 PM, Brian Geffon <briang@apache.org
> <javascript:_e(%7B%7D,'cvml','briang@apache.org');>
> >>> wrote:
> >>
> >>> Right.
> >>>
> >>>
> >>> On Tuesday, March 25, 2014, Leif Hedstrom <zwoop@apache.org
> <javascript:_e(%7B%7D,'cvml','zwoop@apache.org');>>
> >>> wrote:
> >>>
> >>>>
> >>>>> On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com>
> wrote:
> >>>>>
> >>>>> What Thomas called Question #1 -- 1 Regex.
> >>>>
> >>>> Makes sense to have them combined. Assuming groups etc. works, that
> >>>> allows you to do e.g.
> >>>>
> >>>>   regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
> >>>>
> >>>>
> >>>> or some such. i.e. take parts from the path match and use as the host,
> >>>> and vice versa. Right? :)
> >>>>
> >>>> Cheers,
> >>>>
> >>>> -- Leif
> >>
>

Re: Remap/regex_remap consolidation

Posted by Leif Hedstrom <zw...@apache.org>.


> On Mar 26, 2014, at 2:02 AM, Brian Geffon <br...@apache.org> wrote:
> 
> Thomas, I somewhat agree: my guess would be the additional regexes would
> likely cancel any performance gain there.
> 
> Does anyone else have feedback or comments?


The other argument for this is that with separate regexes, you don't have to create the full URL string representation. I don't know if the core has any optimizations here, but for a plugin that is an expensive operation. With separate regexes for host and path, this is a non-issue.

Maybe you can do separate regexes but let expansions cross over? $h[1] for a host group etc.

Pros and cons :)

-- Leif 

> 
> Brian
> 
>> On Tuesday, March 25, 2014, Thomas Jackson <ja...@gmail.com> wrote:
>> 
>> Another consideration for having >1 regex (which may or may not be
>> premature optimization) is that if you have seperate regexes we can create
>> hash maps similar to how maps work (a hashmap of domain_regex -> list of
>> path regexes) which would make overall remap performance faster/better.
>> 
>> 
>> On Tue, Mar 25, 2014 at 9:50 PM, Brian Geffon <briang@apache.org<javascript:_e(%7B%7D,'cvml','briang@apache.org');>
>>> wrote:
>> 
>>> Right.
>>> 
>>> 
>>> On Tuesday, March 25, 2014, Leif Hedstrom <zwoop@apache.org<javascript:_e(%7B%7D,'cvml','zwoop@apache.org');>>
>>> wrote:
>>> 
>>>> 
>>>>> On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com> wrote:
>>>>> 
>>>>> What Thomas called Question #1 -- 1 Regex.
>>>> 
>>>> Makes sense to have them combined. Assuming groups etc. works, that
>>>> allows you to do e.g.
>>>> 
>>>>   regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
>>>> 
>>>> 
>>>> or some such. i.e. take parts from the path match and use as the host,
>>>> and vice versa. Right? :)
>>>> 
>>>> Cheers,
>>>> 
>>>> -- Leif
>>

Re: Remap/regex_remap consolidation

Posted by Leif Hedstrom <zw...@apache.org>.


> On Mar 26, 2014, at 2:02 AM, Brian Geffon <br...@apache.org> wrote:
> 
> Thomas, I somewhat agree: my guess would be the additional regexes would
> likely cancel any performance gain there.
> 
> Does anyone else have feedback or comments?


The other argument for this is that with separate regexes, you don't have to create the full URL string representation. I don't know if the core has any optimizations here, but for a plugin that is an expensive operation. With separate regexes for host and path, this is a non-issue.

Maybe you can do separate regexes but let expansions cross over? $h[1] for a host group etc.

Pros and cons :)

-- Leif 

> 
> Brian
> 
>> On Tuesday, March 25, 2014, Thomas Jackson <ja...@gmail.com> wrote:
>> 
>> Another consideration for having >1 regex (which may or may not be
>> premature optimization) is that if you have seperate regexes we can create
>> hash maps similar to how maps work (a hashmap of domain_regex -> list of
>> path regexes) which would make overall remap performance faster/better.
>> 
>> 
>> On Tue, Mar 25, 2014 at 9:50 PM, Brian Geffon <briang@apache.org<javascript:_e(%7B%7D,'cvml','briang@apache.org');>
>>> wrote:
>> 
>>> Right.
>>> 
>>> 
>>> On Tuesday, March 25, 2014, Leif Hedstrom <zwoop@apache.org<javascript:_e(%7B%7D,'cvml','zwoop@apache.org');>>
>>> wrote:
>>> 
>>>> 
>>>>> On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com> wrote:
>>>>> 
>>>>> What Thomas called Question #1 -- 1 Regex.
>>>> 
>>>> Makes sense to have them combined. Assuming groups etc. works, that
>>>> allows you to do e.g.
>>>> 
>>>>   regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
>>>> 
>>>> 
>>>> or some such. i.e. take parts from the path match and use as the host,
>>>> and vice versa. Right? :)
>>>> 
>>>> Cheers,
>>>> 
>>>> -- Leif
>>

Re: Remap/regex_remap consolidation

Posted by Brian Geffon <br...@apache.org>.

Thomas, I somewhat agree: my guess would be the additional regexes would
likely cancel any performance gain there.

Does anyone else have feedback or comments?

Brian

On Tuesday, March 25, 2014, Thomas Jackson <ja...@gmail.com> wrote:

> Another consideration for having >1 regex (which may or may not be
> premature optimization) is that if you have seperate regexes we can create
> hash maps similar to how maps work (a hashmap of domain_regex -> list of
> path regexes) which would make overall remap performance faster/better.
>
>
> On Tue, Mar 25, 2014 at 9:50 PM, Brian Geffon <briang@apache.org<javascript:_e(%7B%7D,'cvml','briang@apache.org');>
> > wrote:
>
>> Right.
>>
>>
>> On Tuesday, March 25, 2014, Leif Hedstrom <zwoop@apache.org<javascript:_e(%7B%7D,'cvml','zwoop@apache.org');>>
>> wrote:
>>
>>>
>>> On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com> wrote:
>>>
>>> > What Thomas called Question #1 -- 1 Regex.
>>> >
>>>
>>> Makes sense to have them combined. Assuming groups etc. works, that
>>> allows you to do e.g.
>>>
>>>    regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
>>>
>>>
>>> or some such. i.e. take parts from the path match and use as the host,
>>> and vice versa. Right? :)
>>>
>>> Cheers,
>>>
>>> -- Leif
>>>
>>>
>

Re: Remap/regex_remap consolidation

Posted by Brian Geffon <br...@apache.org>.

Thomas, I somewhat agree: my guess would be the additional regexes would
likely cancel any performance gain there.

Does anyone else have feedback or comments?

Brian

On Tuesday, March 25, 2014, Thomas Jackson <ja...@gmail.com> wrote:

> Another consideration for having >1 regex (which may or may not be
> premature optimization) is that if you have seperate regexes we can create
> hash maps similar to how maps work (a hashmap of domain_regex -> list of
> path regexes) which would make overall remap performance faster/better.
>
>
> On Tue, Mar 25, 2014 at 9:50 PM, Brian Geffon <briang@apache.org<javascript:_e(%7B%7D,'cvml','briang@apache.org');>
> > wrote:
>
>> Right.
>>
>>
>> On Tuesday, March 25, 2014, Leif Hedstrom <zwoop@apache.org<javascript:_e(%7B%7D,'cvml','zwoop@apache.org');>>
>> wrote:
>>
>>>
>>> On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com> wrote:
>>>
>>> > What Thomas called Question #1 -- 1 Regex.
>>> >
>>>
>>> Makes sense to have them combined. Assuming groups etc. works, that
>>> allows you to do e.g.
>>>
>>>    regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
>>>
>>>
>>> or some such. i.e. take parts from the path match and use as the host,
>>> and vice versa. Right? :)
>>>
>>> Cheers,
>>>
>>> -- Leif
>>>
>>>
>

Re: Remap/regex_remap consolidation

Posted by Thomas Jackson <ja...@gmail.com>.

Another consideration for having >1 regex (which may or may not be
premature optimization) is that if you have seperate regexes we can create
hash maps similar to how maps work (a hashmap of domain_regex -> list of
path regexes) which would make overall remap performance faster/better.

On Tue, Mar 25, 2014 at 9:50 PM, Brian Geffon <br...@apache.org> wrote:

> Right.
>
>
> On Tuesday, March 25, 2014, Leif Hedstrom <zw...@apache.org> wrote:
>
>>
>> On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com> wrote:
>>
>> > What Thomas called Question #1 -- 1 Regex.
>> >
>>
>> Makes sense to have them combined. Assuming groups etc. works, that
>> allows you to do e.g.
>>
>>    regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
>>
>>
>> or some such. i.e. take parts from the path match and use as the host,
>> and vice versa. Right? :)
>>
>> Cheers,
>>
>> -- Leif
>>
>>

Re: Remap/regex_remap consolidation

Posted by Thomas Jackson <ja...@gmail.com>.

Another consideration for having >1 regex (which may or may not be
premature optimization) is that if you have seperate regexes we can create
hash maps similar to how maps work (a hashmap of domain_regex -> list of
path regexes) which would make overall remap performance faster/better.

On Tue, Mar 25, 2014 at 9:50 PM, Brian Geffon <br...@apache.org> wrote:

> Right.
>
>
> On Tuesday, March 25, 2014, Leif Hedstrom <zw...@apache.org> wrote:
>
>>
>> On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com> wrote:
>>
>> > What Thomas called Question #1 -- 1 Regex.
>> >
>>
>> Makes sense to have them combined. Assuming groups etc. works, that
>> allows you to do e.g.
>>
>>    regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
>>
>>
>> or some such. i.e. take parts from the path match and use as the host,
>> and vice versa. Right? :)
>>
>> Cheers,
>>
>> -- Leif
>>
>>

Re: Remap/regex_remap consolidation

Posted by Brian Geffon <br...@apache.org>.

Right.

On Tuesday, March 25, 2014, Leif Hedstrom <zw...@apache.org> wrote:

>
> On Mar 25, 2014, at 7:51 PM, Brian Geffon <briangeffon@gmail.com<javascript:;>>
> wrote:
>
> > What Thomas called Question #1 -- 1 Regex.
> >
>
> Makes sense to have them combined. Assuming groups etc. works, that allows
> you to do e.g.
>
>    regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
>
>
> or some such. i.e. take parts from the path match and use as the host, and
> vice versa. Right? :)
>
> Cheers,
>
> -- Leif
>
>

Re: Remap/regex_remap consolidation

Posted by Brian Geffon <br...@apache.org>.

Right.

On Tuesday, March 25, 2014, Leif Hedstrom <zw...@apache.org> wrote:

>
> On Mar 25, 2014, at 7:51 PM, Brian Geffon <briangeffon@gmail.com<javascript:;>>
> wrote:
>
> > What Thomas called Question #1 -- 1 Regex.
> >
>
> Makes sense to have them combined. Assuming groups etc. works, that allows
> you to do e.g.
>
>    regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3
>
>
> or some such. i.e. take parts from the path match and use as the host, and
> vice versa. Right? :)
>
> Cheers,
>
> -- Leif
>
>

Re: Remap/regex_remap consolidation

Posted by Leif Hedstrom <zw...@apache.org>.

On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com> wrote:

> What Thomas called Question #1 -- 1 Regex.
> 

Makes sense to have them combined. Assuming groups etc. works, that allows you to do e.g.

   regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3

or some such. i.e. take parts from the path match and use as the host, and vice versa. Right? :)

Cheers,

— Leif

Re: Remap/regex_remap consolidation

Posted by Leif Hedstrom <zw...@apache.org>.

On Mar 25, 2014, at 7:51 PM, Brian Geffon <br...@gmail.com> wrote:

> What Thomas called Question #1 -- 1 Regex.
> 

Makes sense to have them combined. Assuming groups etc. works, that allows you to do e.g.

   regex_map http://(.*)\.ogre\.com/([^/]+)/(.*)  http://$2/$1/$3

or some such. i.e. take parts from the path match and use as the host, and vice versa. Right? :)

Cheers,

— Leif

Re: Remap/regex_remap consolidation

Posted by Brian Geffon <br...@gmail.com>.

What Thomas called Question #1 -- 1 Regex.

Brian

On Tuesday, March 25, 2014, Leif Hedstrom <zw...@apache.org> wrote:

>
> On Mar 25, 2014, at 7:42 PM, Brian Geffon <briang@apache.org<javascript:;>>
> wrote:
>
> > I personally vote for regexing everything until we get around to
> finalizing
> > lua config, I also vote for this being the one who will end up writing
> the
> > code :)
>
>
>
> Which option is that?
>
> -- Leif
>
>

Re: Remap/regex_remap consolidation

Posted by Brian Geffon <br...@gmail.com>.

What Thomas called Question #1 -- 1 Regex.

Brian

On Tuesday, March 25, 2014, Leif Hedstrom <zw...@apache.org> wrote:

>
> On Mar 25, 2014, at 7:42 PM, Brian Geffon <briang@apache.org<javascript:;>>
> wrote:
>
> > I personally vote for regexing everything until we get around to
> finalizing
> > lua config, I also vote for this being the one who will end up writing
> the
> > code :)
>
>
>
> Which option is that?
>
> -- Leif
>
>

Re: Remap/regex_remap consolidation

Posted by Leif Hedstrom <zw...@apache.org>.

On Mar 25, 2014, at 7:42 PM, Brian Geffon <br...@apache.org> wrote:

> I personally vote for regexing everything until we get around to finalizing
> lua config, I also vote for this being the one who will end up writing the
> code :)



Which option is that?

— Leif

Re: Remap/regex_remap consolidation

Posted by Leif Hedstrom <zw...@apache.org>.

On Mar 25, 2014, at 7:42 PM, Brian Geffon <br...@apache.org> wrote:

> I personally vote for regexing everything until we get around to finalizing
> lua config, I also vote for this being the one who will end up writing the
> code :)



Which option is that?

— Leif

Re: Remap/regex_remap consolidation

Posted by Brian Geffon <br...@apache.org>.

I personally vote for regexing everything until we get around to finalizing
lua config, I also vote for this being the one who will end up writing the
code :)

Brian

On Tuesday, March 25, 2014, James Peach <jp...@apache.org> wrote:

> On Mar 25, 2014, at 3:28 PM, Leif Hedstrom <zwoop@apache.org<javascript:;>>
> wrote:
>
> >
> > On Mar 25, 2014, at 4:08 PM, James Peach <jpeach@apache.org<javascript:;>>
> wrote:
> >
> >> On Mar 25, 2014, at 1:50 PM, Thomas Jackson <jacksontj.89@gmail.com<javascript:;>>
> wrote:
> >>
> >>> Here at LinkedIn we've been using regular remap.config for a while
> (with
> >>> all our map options). One thing we've been looking into recently is
> path
> >>> based regexes (which regex_remap supports). While looking into it we
> found
> >>> a few shortcomings of the plugin-- and decided it would probably be
> better
> >>> for everyone involved if we could come up with a way to consolidate
> this
> >>> into regular remap. This raises a few questions about how to
> consolidate,
> >>> so we figure we'd solicit some feedback from the community before we
> get
> >>> started. Since there are quite a few changes we're considering I've
> tried
> >>> to assemble examples for all the scenarios, but to start out I'll put
> some
> >>> examples of how remaps look today.
> >>
> >> Hi Thomas,
> >>
> >> I think there is a real need for something like this. I think that it
> is pretty common to follow a regex_map with a secondary set of regex_remap
> rules. One observation that I have is that the current system conflates
> matching and rewriting. I think that the is why you end up with a large set
> of alternative regex-based syntaxes below. Have you considered a system of
> simple, composable match and rewrite operators?
> >>
> >> A made-up example, to match https://*.example.com and rewrite it ...
> >>
> >> Old version:
> >>      regex_map https://.*.example.com/foo http://dest.example.com
> >>
> >> New version:
> >>      map @scheme @value=https @replace=http \
> >>              @host @match=*.example.com @replace=dest.example.com \
> >>              @path @match=/foo(.*) @replace=$0
> >>
> >> The new version is verbose, but extensible and more flexible than the
> fixed syntax. I haven't really thought this through very much ...
> >
> >
> > I agree that regex_map is not great. However, the above seems very
> involved, and something I believe ought to be part of the new Lua config
> format. I.e. lets focus on the Lua configuration stuff, instead of baking
> more crud into the current remap.config mess?
>
> Irrespective of ugly syntax, I think the fundamental issue here is the
> need to vary matching and URL rewriting independently. You could do it by
> having Lua blocks, you could do it by baking more crud into remap.config,
> or you could do it by regexing everything :)
>
> J

Re: Remap/regex_remap consolidation

Posted by Brian Geffon <br...@apache.org>.

I personally vote for regexing everything until we get around to finalizing
lua config, I also vote for this being the one who will end up writing the
code :)

Brian

On Tuesday, March 25, 2014, James Peach <jp...@apache.org> wrote:

> On Mar 25, 2014, at 3:28 PM, Leif Hedstrom <zwoop@apache.org<javascript:;>>
> wrote:
>
> >
> > On Mar 25, 2014, at 4:08 PM, James Peach <jpeach@apache.org<javascript:;>>
> wrote:
> >
> >> On Mar 25, 2014, at 1:50 PM, Thomas Jackson <jacksontj.89@gmail.com<javascript:;>>
> wrote:
> >>
> >>> Here at LinkedIn we've been using regular remap.config for a while
> (with
> >>> all our map options). One thing we've been looking into recently is
> path
> >>> based regexes (which regex_remap supports). While looking into it we
> found
> >>> a few shortcomings of the plugin-- and decided it would probably be
> better
> >>> for everyone involved if we could come up with a way to consolidate
> this
> >>> into regular remap. This raises a few questions about how to
> consolidate,
> >>> so we figure we'd solicit some feedback from the community before we
> get
> >>> started. Since there are quite a few changes we're considering I've
> tried
> >>> to assemble examples for all the scenarios, but to start out I'll put
> some
> >>> examples of how remaps look today.
> >>
> >> Hi Thomas,
> >>
> >> I think there is a real need for something like this. I think that it
> is pretty common to follow a regex_map with a secondary set of regex_remap
> rules. One observation that I have is that the current system conflates
> matching and rewriting. I think that the is why you end up with a large set
> of alternative regex-based syntaxes below. Have you considered a system of
> simple, composable match and rewrite operators?
> >>
> >> A made-up example, to match https://*.example.com and rewrite it ...
> >>
> >> Old version:
> >>      regex_map https://.*.example.com/foo http://dest.example.com
> >>
> >> New version:
> >>      map @scheme @value=https @replace=http \
> >>              @host @match=*.example.com @replace=dest.example.com \
> >>              @path @match=/foo(.*) @replace=$0
> >>
> >> The new version is verbose, but extensible and more flexible than the
> fixed syntax. I haven't really thought this through very much ...
> >
> >
> > I agree that regex_map is not great. However, the above seems very
> involved, and something I believe ought to be part of the new Lua config
> format. I.e. lets focus on the Lua configuration stuff, instead of baking
> more crud into the current remap.config mess?
>
> Irrespective of ugly syntax, I think the fundamental issue here is the
> need to vary matching and URL rewriting independently. You could do it by
> having Lua blocks, you could do it by baking more crud into remap.config,
> or you could do it by regexing everything :)
>
> J

Re: Remap/regex_remap consolidation

Posted by James Peach <jp...@apache.org>.

On Mar 25, 2014, at 3:28 PM, Leif Hedstrom <zw...@apache.org> wrote:

> 
> On Mar 25, 2014, at 4:08 PM, James Peach <jp...@apache.org> wrote:
> 
>> On Mar 25, 2014, at 1:50 PM, Thomas Jackson <ja...@gmail.com> wrote:
>> 
>>> Here at LinkedIn we've been using regular remap.config for a while (with
>>> all our map options). One thing we've been looking into recently is path
>>> based regexes (which regex_remap supports). While looking into it we found
>>> a few shortcomings of the plugin-- and decided it would probably be better
>>> for everyone involved if we could come up with a way to consolidate this
>>> into regular remap. This raises a few questions about how to consolidate,
>>> so we figure we'd solicit some feedback from the community before we get
>>> started. Since there are quite a few changes we're considering I've tried
>>> to assemble examples for all the scenarios, but to start out I'll put some
>>> examples of how remaps look today.
>> 
>> Hi Thomas,
>> 
>> I think there is a real need for something like this. I think that it is pretty common to follow a regex_map with a secondary set of regex_remap rules. One observation that I have is that the current system conflates matching and rewriting. I think that the is why you end up with a large set of alternative regex-based syntaxes below. Have you considered a system of simple, composable match and rewrite operators?
>> 
>> A made-up example, to match https://*.example.com and rewrite it ...
>> 
>> Old version:
>> 	regex_map https://.*.example.com/foo http://dest.example.com
>> 
>> New version:
>> 	map @scheme @value=https @replace=http \
>> 		@host @match=*.example.com @replace=dest.example.com \
>> 		@path @match=/foo(.*) @replace=$0
>> 
>> The new version is verbose, but extensible and more flexible than the fixed syntax. I haven't really thought this through very much …
> 
> 
> I agree that regex_map is not great. However, the above seems very involved, and something I believe ought to be part of the new Lua config format. I.e. lets focus on the Lua configuration stuff, instead of baking more crud into the current remap.config mess?

Irrespective of ugly syntax, I think the fundamental issue here is the need to vary matching and URL rewriting independently. You could do it by having Lua blocks, you could do it by baking more crud into remap.config, or you could do it by regexing everything :)

J

Re: Remap/regex_remap consolidation

Posted by James Peach <jp...@apache.org>.

On Mar 25, 2014, at 3:28 PM, Leif Hedstrom <zw...@apache.org> wrote:

> 
> On Mar 25, 2014, at 4:08 PM, James Peach <jp...@apache.org> wrote:
> 
>> On Mar 25, 2014, at 1:50 PM, Thomas Jackson <ja...@gmail.com> wrote:
>> 
>>> Here at LinkedIn we've been using regular remap.config for a while (with
>>> all our map options). One thing we've been looking into recently is path
>>> based regexes (which regex_remap supports). While looking into it we found
>>> a few shortcomings of the plugin-- and decided it would probably be better
>>> for everyone involved if we could come up with a way to consolidate this
>>> into regular remap. This raises a few questions about how to consolidate,
>>> so we figure we'd solicit some feedback from the community before we get
>>> started. Since there are quite a few changes we're considering I've tried
>>> to assemble examples for all the scenarios, but to start out I'll put some
>>> examples of how remaps look today.
>> 
>> Hi Thomas,
>> 
>> I think there is a real need for something like this. I think that it is pretty common to follow a regex_map with a secondary set of regex_remap rules. One observation that I have is that the current system conflates matching and rewriting. I think that the is why you end up with a large set of alternative regex-based syntaxes below. Have you considered a system of simple, composable match and rewrite operators?
>> 
>> A made-up example, to match https://*.example.com and rewrite it ...
>> 
>> Old version:
>> 	regex_map https://.*.example.com/foo http://dest.example.com
>> 
>> New version:
>> 	map @scheme @value=https @replace=http \
>> 		@host @match=*.example.com @replace=dest.example.com \
>> 		@path @match=/foo(.*) @replace=$0
>> 
>> The new version is verbose, but extensible and more flexible than the fixed syntax. I haven't really thought this through very much …
> 
> 
> I agree that regex_map is not great. However, the above seems very involved, and something I believe ought to be part of the new Lua config format. I.e. lets focus on the Lua configuration stuff, instead of baking more crud into the current remap.config mess?

Irrespective of ugly syntax, I think the fundamental issue here is the need to vary matching and URL rewriting independently. You could do it by having Lua blocks, you could do it by baking more crud into remap.config, or you could do it by regexing everything :)

J

Re: Remap/regex_remap consolidation

Posted by Leif Hedstrom <zw...@apache.org>.

On Mar 25, 2014, at 4:08 PM, James Peach <jp...@apache.org> wrote:

> On Mar 25, 2014, at 1:50 PM, Thomas Jackson <ja...@gmail.com> wrote:
> 
>> Here at LinkedIn we've been using regular remap.config for a while (with
>> all our map options). One thing we've been looking into recently is path
>> based regexes (which regex_remap supports). While looking into it we found
>> a few shortcomings of the plugin-- and decided it would probably be better
>> for everyone involved if we could come up with a way to consolidate this
>> into regular remap. This raises a few questions about how to consolidate,
>> so we figure we'd solicit some feedback from the community before we get
>> started. Since there are quite a few changes we're considering I've tried
>> to assemble examples for all the scenarios, but to start out I'll put some
>> examples of how remaps look today.
> 
> Hi Thomas,
> 
> I think there is a real need for something like this. I think that it is pretty common to follow a regex_map with a secondary set of regex_remap rules. One observation that I have is that the current system conflates matching and rewriting. I think that the is why you end up with a large set of alternative regex-based syntaxes below. Have you considered a system of simple, composable match and rewrite operators?
> 
> A made-up example, to match https://*.example.com and rewrite it ...
> 
> Old version:
> 	regex_map https://.*.example.com/foo http://dest.example.com
> 
> New version:
> 	map @scheme @value=https @replace=http \
> 		@host @match=*.example.com @replace=dest.example.com \
> 		@path @match=/foo(.*) @replace=$0
> 
> The new version is verbose, but extensible and more flexible than the fixed syntax. I haven't really thought this through very much …


I agree that regex_map is not great. However, the above seems very involved, and something I believe ought to be part of the new Lua config format. I.e. lets focus on the Lua configuration stuff, instead of baking more crud into the current remap.config mess?

Just my $.01,

— Leif

Re: Remap/regex_remap consolidation

Posted by Leif Hedstrom <zw...@apache.org>.

On Mar 25, 2014, at 4:08 PM, James Peach <jp...@apache.org> wrote:

> On Mar 25, 2014, at 1:50 PM, Thomas Jackson <ja...@gmail.com> wrote:
> 
>> Here at LinkedIn we've been using regular remap.config for a while (with
>> all our map options). One thing we've been looking into recently is path
>> based regexes (which regex_remap supports). While looking into it we found
>> a few shortcomings of the plugin-- and decided it would probably be better
>> for everyone involved if we could come up with a way to consolidate this
>> into regular remap. This raises a few questions about how to consolidate,
>> so we figure we'd solicit some feedback from the community before we get
>> started. Since there are quite a few changes we're considering I've tried
>> to assemble examples for all the scenarios, but to start out I'll put some
>> examples of how remaps look today.
> 
> Hi Thomas,
> 
> I think there is a real need for something like this. I think that it is pretty common to follow a regex_map with a secondary set of regex_remap rules. One observation that I have is that the current system conflates matching and rewriting. I think that the is why you end up with a large set of alternative regex-based syntaxes below. Have you considered a system of simple, composable match and rewrite operators?
> 
> A made-up example, to match https://*.example.com and rewrite it ...
> 
> Old version:
> 	regex_map https://.*.example.com/foo http://dest.example.com
> 
> New version:
> 	map @scheme @value=https @replace=http \
> 		@host @match=*.example.com @replace=dest.example.com \
> 		@path @match=/foo(.*) @replace=$0
> 
> The new version is verbose, but extensible and more flexible than the fixed syntax. I haven't really thought this through very much …


I agree that regex_map is not great. However, the above seems very involved, and something I believe ought to be part of the new Lua config format. I.e. lets focus on the Lua configuration stuff, instead of baking more crud into the current remap.config mess?

Just my $.01,

— Leif

Re: Remap/regex_remap consolidation

Posted by James Peach <jp...@apache.org>.

On Mar 25, 2014, at 1:50 PM, Thomas Jackson <ja...@gmail.com> wrote:

> Here at LinkedIn we've been using regular remap.config for a while (with
> all our map options). One thing we've been looking into recently is path
> based regexes (which regex_remap supports). While looking into it we found
> a few shortcomings of the plugin-- and decided it would probably be better
> for everyone involved if we could come up with a way to consolidate this
> into regular remap. This raises a few questions about how to consolidate,
> so we figure we'd solicit some feedback from the community before we get
> started. Since there are quite a few changes we're considering I've tried
> to assemble examples for all the scenarios, but to start out I'll put some
> examples of how remaps look today.

Hi Thomas,

I think there is a real need for something like this. I think that it is pretty common to follow a regex_map with a secondary set of regex_remap rules. One observation that I have is that the current system conflates matching and rewriting. I think that the is why you end up with a large set of alternative regex-based syntaxes below. Have you considered a system of simple, composable match and rewrite operators?

A made-up example, to match https://*.example.com and rewrite it ...

Old version:
	regex_map https://.*.example.com/foo http://dest.example.com

New version:
	map @scheme @value=https @replace=http \
		@host @match=*.example.com @replace=dest.example.com \
		@path @match=/foo(.*) @replace=$0

The new version is verbose, but extensible and more flexible than the fixed syntax. I haven't really thought this through very much ...

> 
> *Standard remap.config example*
> 
> # ExampleA: match all domains coming in on a specific port with a path
> regex_map_with_recv_port http://.*:8080/bar    http://dest.com:12345/bar
> # ExampleB:  match all domains /foo regardless of port
> regex_map http://.*/foo    http://dest.com:12345/foo
> # ExampleC: match everything else on a specific domain
> map http://foo.com/    http://127.0.0.1:12345/catchall
> 
> In regular remap.config some features are missing-- such as regex matching
> based on the path. If i were to use regex_remap on a regular mapping rule
> like so:
> 
> regex_map_with_recv_port http://.*:8080
> http://dest.com/@plugin=/usr/libexec/regex_remap.so
> @pparam=mapfile.map
> 
> I could then have a file (mapfile.map) which would look something like:
> 
> ### map file contents
> # regular remap
> ^/foo(.*)                     http://dest.com:12345/foo/$1
> # strip the query string
> ^/foo(.*)(\?.*)?               http://dest.com:12345/foo/$1?$q
> # a redirect
> /oldpath(.*)                   http://newdomain.com:8080/newpath$1@status=302
> 
> 
> This regex_remap markup gives us a few nice things. This gives you regex
> matching on the path, which we found to be extremely useful. Specifically
> we have a use case for /foo and /foo/ to go to an app, but not /foobar. In
> addition regex_remap map files give you a cleaner markup for redirects
> (@status=xxx), as well as per-remap-line config overrides. As we started
> looking into it we realized that regex_remap (the plugin) is a bit limited
> since you cannot use remap plugins within the map file. We then started
> looking into adding that, but figured it might be less work (and more
> helpful) to merge this into ATS propper. So when merging these features,
> there are a few config questions to be answered.
> 
> 
> *Question #1 how many regexes?*
> Today there is one regex (in all the different regex_* types) which is only
> on the domain name. In regex_remap there is one regex, but it matches the
> path and the query string. So the question is how many different regexes
> should their be?
> 
> To get some background of what these look like I'll implement a rule where
> we match http/https on all domains, ports 8081 and 8082, and all paths.
> 
> The main ones we've thought of so far are:
> 
>    *1 regex*- so the entire string you are matching on is one big regex
>        ^https?://.*:808[12]/(.*)$
> 
>        pros: fairly simple, dense (4 lines of today's configs can be
> merged into 1)
>        cons: easy to mess up (lots to match at once). Fairly difficult to
> tell what its doing
> 
>    *2 regex*- one for domain, and one for path
>        http://.*:8081/(.*)$
>        http://.*:8082/(.*)$
>        https://.*:8081/(.*)$
>        https://.*:8082/(.*)$
> 
>        pros: Closer matches what we do today
>        cons: more verbose (can't regex the scheme or port)
> 
>    *4 regex*- one for schema, domain, port, and path
>        ^https?$://^.*$:^808[12]$/^(.*)$
> 
>        pros: seperation of the various regexes, dense, impossible to
> capture more than the field you are in
>        cons: 4 regexes instead of 1 (might be more confusing?)
> 
> *Note*: any number of regexes >1 will require named capture groups (
> http://www.regular-expressions.info/named.html). Which means that you
> cannot use $1, $2, etc. In a lot of ways this is nicer (more explicit) but
> it is a change.
> 
> *What do I think?*
> I prefer #2 or #4 as they help seperate the regex matching into smaller
> regexes (which should make finding non-matching regexes faster) as well as
> make the regexes more scoped-- and hopefully harder to mess up (especially
> by matching too much).
> 
> 
> *Question #2 Explicit vs implicit regexes*
> If we decide to have more than one regex (from #1), do we want all of them
> to be implicitly handled as regex strings? Or do we want to rely on some
> anchoring syntax to flag to the remap engine that the string is a regex
> (such as requiring regexes to start with a '^' and end with a '$').
> 
>    Explicit:
>        pros: clearer that the field is a regex, easier to optimize the
> remap engine
>        cons: requires more markup, and would mean that if you just put in
> .* it would be a string match
>    Implicit:
>        pros: closer matches the markup we have today, simpler configs
>        cons: wasteful if most fields are strings
> 
> *What do I think?*
> In general (and in this specific case) I like explicit over implicit since
> its more clear what you are doing. Especially if we pick 4 regexes (from
> #1) this would allow you to effectively "disable" regex matching within
> specific fields if you don't need it.
> 
> 
> *Question #3 How to handle query strings in the match?*
> Today the query string is not part of remap.config. In regex_remap it is
> optional based on a @pparam=no-query-string. The advantage of not having it
> in the path is you don't have to worry about matching it accidentally or
> reconstructing it. The downside is that you can never match on it. This
> could be controlled by some @ parameter, but that could make remap.config a
> bit confusing since it wouldn't be consistent.
> 
> *What do I think?*
> I would like to leave the query parameters out, or at least have some @
> parameter to disable them on a per-line basis (since I don't want to have
> rules matching on query params).
> 
> 
> *Question #4 How do you *drop* the query string?*
> If for #3 we do decide to put the query string in the match, how would we
> drop it on a specific path? In the regex_remap plugin I'd simply create a
> rule such as:
> 
> # remove query params in regex_remap
> ^/(.*)(\?.*)?  /$1
> 
> And by not adding ?$2 (or something similar) I'd be removing the query
> parameters. Conversely, if I want to keep the query parameters that means i
> effectively have to re-add them every remap line like so:
> 
> # keep the query params
> ^/(.*)(\?.*)?  /$1/$2
> # or with the shorthand
> ^/(.*)(\?.*)?  /$1/$q
> 
> *What do I think?*
> No real opinion on this one, mostly because I don't really want the query
> params in the matching string to begin with :)
> 
> 
> *#5 How to remap paths that are regexed*
> Today we have remap lines that look something like:
> regex_map http://.*/foo http://dest.com:12345/
> 
> What this means is that any request with path starting with "/foo" will be
> remapped and the "/foo" will be replaced with "/". Once we allow regexes in
> the path we have more variables to take into account. If we take this same
> simple case with regex_remap it would look something like:
> /foo(.*)  http://dest.com:12345/$1
> 
> If we were to mimic this markup in remap.config it could look something
> like:
> regex_map http://.*/foo(.*) http://dest.com:12345/$1
> 
> This is a bit more explicit in what it is doing, but in this simple case it
> means quite a few more characters to get the same meaning across. What this
> gets you is the flexibility to do more complex regex matches if needed. If
> we do go with markup like this and we pick anything more than 1 regex from
> #1 we'd be forced to use named groups instead of $1, $2, etc. since the
> strings matched would come from different regexes
> 
> 
> *What do I think?*
> If we used 4 regexes (from #1) and explicit (from #2) we could keep the
> markup pretty similar to what we have now and still have the ability to do
> some cool things.
> 
> So, for the base case of regex_map today (regex domain only) it would look
> like:
> regex_map http://^.*$/foo http://dest.com:12345/foo
> 
> This would only regex the domain name (since it starts with ^ and ends with
> $) and then the path would be treated like a regular map (same as it is
> today). If i wanted to do some regexing based on the path I could write
> something like:
> 
> regex_map http://^.*$^/foo(?P<num>\d+)$ http://dest.com:12345/$num/foo
> 
> 
> 
> 
> If you have any opinions/feedback about these specifics please let me
> know-- we're hoping to nail down the markup fairly quickly and get this
> taken care of in the next few weeks. If you have questions about what
> markup would look like (and don't want to spam the mailing list) feel free
> to mail me individually or PM me on IRC (jacksontj).
> 
> 
> Thomas Jackson
> Traffic SRE @ LinkedIn

Re: Remap/regex_remap consolidation

Posted by James Peach <jp...@apache.org>.

On Mar 25, 2014, at 1:50 PM, Thomas Jackson <ja...@gmail.com> wrote:

> Here at LinkedIn we've been using regular remap.config for a while (with
> all our map options). One thing we've been looking into recently is path
> based regexes (which regex_remap supports). While looking into it we found
> a few shortcomings of the plugin-- and decided it would probably be better
> for everyone involved if we could come up with a way to consolidate this
> into regular remap. This raises a few questions about how to consolidate,
> so we figure we'd solicit some feedback from the community before we get
> started. Since there are quite a few changes we're considering I've tried
> to assemble examples for all the scenarios, but to start out I'll put some
> examples of how remaps look today.

Hi Thomas,

I think there is a real need for something like this. I think that it is pretty common to follow a regex_map with a secondary set of regex_remap rules. One observation that I have is that the current system conflates matching and rewriting. I think that the is why you end up with a large set of alternative regex-based syntaxes below. Have you considered a system of simple, composable match and rewrite operators?

A made-up example, to match https://*.example.com and rewrite it ...

Old version:
	regex_map https://.*.example.com/foo http://dest.example.com

New version:
	map @scheme @value=https @replace=http \
		@host @match=*.example.com @replace=dest.example.com \
		@path @match=/foo(.*) @replace=$0

The new version is verbose, but extensible and more flexible than the fixed syntax. I haven't really thought this through very much ...

> 
> *Standard remap.config example*
> 
> # ExampleA: match all domains coming in on a specific port with a path
> regex_map_with_recv_port http://.*:8080/bar    http://dest.com:12345/bar
> # ExampleB:  match all domains /foo regardless of port
> regex_map http://.*/foo    http://dest.com:12345/foo
> # ExampleC: match everything else on a specific domain
> map http://foo.com/    http://127.0.0.1:12345/catchall
> 
> In regular remap.config some features are missing-- such as regex matching
> based on the path. If i were to use regex_remap on a regular mapping rule
> like so:
> 
> regex_map_with_recv_port http://.*:8080
> http://dest.com/@plugin=/usr/libexec/regex_remap.so
> @pparam=mapfile.map
> 
> I could then have a file (mapfile.map) which would look something like:
> 
> ### map file contents
> # regular remap
> ^/foo(.*)                     http://dest.com:12345/foo/$1
> # strip the query string
> ^/foo(.*)(\?.*)?               http://dest.com:12345/foo/$1?$q
> # a redirect
> /oldpath(.*)                   http://newdomain.com:8080/newpath$1@status=302
> 
> 
> This regex_remap markup gives us a few nice things. This gives you regex
> matching on the path, which we found to be extremely useful. Specifically
> we have a use case for /foo and /foo/ to go to an app, but not /foobar. In
> addition regex_remap map files give you a cleaner markup for redirects
> (@status=xxx), as well as per-remap-line config overrides. As we started
> looking into it we realized that regex_remap (the plugin) is a bit limited
> since you cannot use remap plugins within the map file. We then started
> looking into adding that, but figured it might be less work (and more
> helpful) to merge this into ATS propper. So when merging these features,
> there are a few config questions to be answered.
> 
> 
> *Question #1 how many regexes?*
> Today there is one regex (in all the different regex_* types) which is only
> on the domain name. In regex_remap there is one regex, but it matches the
> path and the query string. So the question is how many different regexes
> should their be?
> 
> To get some background of what these look like I'll implement a rule where
> we match http/https on all domains, ports 8081 and 8082, and all paths.
> 
> The main ones we've thought of so far are:
> 
>    *1 regex*- so the entire string you are matching on is one big regex
>        ^https?://.*:808[12]/(.*)$
> 
>        pros: fairly simple, dense (4 lines of today's configs can be
> merged into 1)
>        cons: easy to mess up (lots to match at once). Fairly difficult to
> tell what its doing
> 
>    *2 regex*- one for domain, and one for path
>        http://.*:8081/(.*)$
>        http://.*:8082/(.*)$
>        https://.*:8081/(.*)$
>        https://.*:8082/(.*)$
> 
>        pros: Closer matches what we do today
>        cons: more verbose (can't regex the scheme or port)
> 
>    *4 regex*- one for schema, domain, port, and path
>        ^https?$://^.*$:^808[12]$/^(.*)$
> 
>        pros: seperation of the various regexes, dense, impossible to
> capture more than the field you are in
>        cons: 4 regexes instead of 1 (might be more confusing?)
> 
> *Note*: any number of regexes >1 will require named capture groups (
> http://www.regular-expressions.info/named.html). Which means that you
> cannot use $1, $2, etc. In a lot of ways this is nicer (more explicit) but
> it is a change.
> 
> *What do I think?*
> I prefer #2 or #4 as they help seperate the regex matching into smaller
> regexes (which should make finding non-matching regexes faster) as well as
> make the regexes more scoped-- and hopefully harder to mess up (especially
> by matching too much).
> 
> 
> *Question #2 Explicit vs implicit regexes*
> If we decide to have more than one regex (from #1), do we want all of them
> to be implicitly handled as regex strings? Or do we want to rely on some
> anchoring syntax to flag to the remap engine that the string is a regex
> (such as requiring regexes to start with a '^' and end with a '$').
> 
>    Explicit:
>        pros: clearer that the field is a regex, easier to optimize the
> remap engine
>        cons: requires more markup, and would mean that if you just put in
> .* it would be a string match
>    Implicit:
>        pros: closer matches the markup we have today, simpler configs
>        cons: wasteful if most fields are strings
> 
> *What do I think?*
> In general (and in this specific case) I like explicit over implicit since
> its more clear what you are doing. Especially if we pick 4 regexes (from
> #1) this would allow you to effectively "disable" regex matching within
> specific fields if you don't need it.
> 
> 
> *Question #3 How to handle query strings in the match?*
> Today the query string is not part of remap.config. In regex_remap it is
> optional based on a @pparam=no-query-string. The advantage of not having it
> in the path is you don't have to worry about matching it accidentally or
> reconstructing it. The downside is that you can never match on it. This
> could be controlled by some @ parameter, but that could make remap.config a
> bit confusing since it wouldn't be consistent.
> 
> *What do I think?*
> I would like to leave the query parameters out, or at least have some @
> parameter to disable them on a per-line basis (since I don't want to have
> rules matching on query params).
> 
> 
> *Question #4 How do you *drop* the query string?*
> If for #3 we do decide to put the query string in the match, how would we
> drop it on a specific path? In the regex_remap plugin I'd simply create a
> rule such as:
> 
> # remove query params in regex_remap
> ^/(.*)(\?.*)?  /$1
> 
> And by not adding ?$2 (or something similar) I'd be removing the query
> parameters. Conversely, if I want to keep the query parameters that means i
> effectively have to re-add them every remap line like so:
> 
> # keep the query params
> ^/(.*)(\?.*)?  /$1/$2
> # or with the shorthand
> ^/(.*)(\?.*)?  /$1/$q
> 
> *What do I think?*
> No real opinion on this one, mostly because I don't really want the query
> params in the matching string to begin with :)
> 
> 
> *#5 How to remap paths that are regexed*
> Today we have remap lines that look something like:
> regex_map http://.*/foo http://dest.com:12345/
> 
> What this means is that any request with path starting with "/foo" will be
> remapped and the "/foo" will be replaced with "/". Once we allow regexes in
> the path we have more variables to take into account. If we take this same
> simple case with regex_remap it would look something like:
> /foo(.*)  http://dest.com:12345/$1
> 
> If we were to mimic this markup in remap.config it could look something
> like:
> regex_map http://.*/foo(.*) http://dest.com:12345/$1
> 
> This is a bit more explicit in what it is doing, but in this simple case it
> means quite a few more characters to get the same meaning across. What this
> gets you is the flexibility to do more complex regex matches if needed. If
> we do go with markup like this and we pick anything more than 1 regex from
> #1 we'd be forced to use named groups instead of $1, $2, etc. since the
> strings matched would come from different regexes
> 
> 
> *What do I think?*
> If we used 4 regexes (from #1) and explicit (from #2) we could keep the
> markup pretty similar to what we have now and still have the ability to do
> some cool things.
> 
> So, for the base case of regex_map today (regex domain only) it would look
> like:
> regex_map http://^.*$/foo http://dest.com:12345/foo
> 
> This would only regex the domain name (since it starts with ^ and ends with
> $) and then the path would be treated like a regular map (same as it is
> today). If i wanted to do some regexing based on the path I could write
> something like:
> 
> regex_map http://^.*$^/foo(?P<num>\d+)$ http://dest.com:12345/$num/foo
> 
> 
> 
> 
> If you have any opinions/feedback about these specifics please let me
> know-- we're hoping to nail down the markup fairly quickly and get this
> taken care of in the next few weeks. If you have questions about what
> markup would look like (and don't want to spam the mailing list) feel free
> to mail me individually or PM me on IRC (jacksontj).
> 
> 
> Thomas Jackson
> Traffic SRE @ LinkedIn