PHP preg_match_all subpattern names in a pattern

问题内容:

The task is pretty clear. In the input we have a variable regex pattern, which supposedly contains named subpatterns, and in the output we need to get an array of subpattern names:

function get_subpattern_names($any_input_pattern) {
  // What pattern to use here?
  $pattern_to_get_names = '/.../';

  preg_match_all($pattern_to_get_names, $any_input_pattern, $matches);

  return $matches;
}

So the question is what to use as $pattern_to_get_names in the function above?

For example:

get_subpattern_names('/(?P<name>\w+): (?P<digit>\d+)/');

should return:

array('name', 'digit');

P.S.: According to PCRE documentation subpattern names consist of up to 32 alphanumeric characters and underscores.

As we don’t control the input pattern, we need to take into account all possible syntaxes of naming. According to PHP documentation they are:

(?P<name>pattern), (?<name>pattern) and (?'name'pattern).

We also need to take into account nested subpatterns, for example:

(?<name1>.*(?<name2>pattern).*).

There’s no need to count duplicating names, to preserve the appearance order, or to get numerical, non-capturing or other types of subpatterns. Just list of names if present.

问题评论:

答案:

答案1:

You may get a list of all valid named capture group names using

"~(?<!\\\\)(?:\\\\{2})*\(\?(?|P?<([_A-Za-z]\w{0,31})>|'([_A-Za-z]\w{0,31})')~"

See the regex and an online PHP demo.

The point is to match an unescaped ( that is followed with a ? that is then followed with either P< or < and then has a group name pattern ending with > or ' followed with the group name pattern and then '.

$rx = "~(?<!\\\\)(?:\\\\{2})*\(\?(?|P?<([_A-Za-z]\w{0,31})>|'([_A-Za-z]\w{0,31})')~";
$s = "(?P<name>\w+): (?<name2>\w+): (?'digit'\d+)";
preg_match_all($rx, $s, $res);
print_r($res[1]);

yields

Array
(
    [0] => name
    [1] => name2
    [2] => digit
)

Pattern details

  • (?<!\\) – no \ immediately to the left of the current location
  • (?:\\\\)* – 0+ double backslashes (to allow any escaped backslash before ()
  • \( – a (
  • \? – a ?
  • (?|P?<([_A-Za-z]\w{0,31})>|'([_A-Za-z]\w{0,31})') – a branch reset group:
    • P?<([_A-Za-z]\w{0,31})> – an optional P, <, a _ or an ASCII letter, 0 to 31 word chars (digits/letters/_) (captured into Group 1), and >
    • | – or
    • '([_A-Za-z]\w{0,31})'', a _ or an ASCII letter, 0 to 31 word chars (digits/letters/_) (also captured into Group 1), and then '

The group name patterns are all captured into Group 1, you just need to get $res[1].

答案评论:

答案2:

Wiktor’s solution does seem quite thorough, but here’s what I came up with.

print_r(get_subpattern_names('/(?P<name>\w+): (?P<digit>\d+)/'));

function get_subpattern_names($input_pattern){
    preg_match_all('/\?P\<(.+?)\>/i', $input_pattern, $matches);
    return $matches[1];
}

This should work for most cases. More importantly, this is much more readable and self-explanatory.

Basically, I search for ?P< followed by (.+?) which translates to a non-greedy version of something in between the angular brackets. The function then just returns the first offset in the $matches array which points to the first set of parenthesis matched.

答案评论:

原文地址:

https://stackoverflow.com/questions/47753306/php-preg-match-all-subpattern-names-in-a-pattern

添加评论

友情链接:蝴蝶教程