preg_match_all() - php 正则表达式（PCRE）

乐乐1年前 (2023-11-21)阅读数 24#技术干货

preg_match_all()

(PHP 4, PHP 5, PHP 7)

执行一个全局正则表达式匹配

说明

preg_match_all(string $pattern,string $subject[,array &$matches[,int $flags= PREG_PATTERN_ORDER[,int $offset= 0]]]): int

搜索$subject中所有匹配$pattern给定正则表达式的匹配结果并且将它们以$flag指定顺序输出到$matches中.

在第一个匹配找到后,子序列继续从最后一次匹配位置搜索.

参数

$pattern

要搜索的模式，字符串形式。

$subject

输入字符串。

$matches

多维数组，作为输出参数输出所有匹配结果,数组排序通过$flags指定。

$flags

可以结合下面标记使用(注意不能同时使用PREG_PATTERN_ORDER和PREG_SET_ORDER)：PREG_PATTERN_ORDER

结果排序为$matches[0]保存完整模式的所有匹配,$matches[1]保存第一个子组的所有匹配，以此类推。

以上例程会输出：

example: , this is a testexample: , this is a test

因此,$out[0]是包含匹配完整模式的字符串的数组，$out[1]是包含闭合标签内的字符串的数组。

如果正则表达式包含了带名称的子组，$matches额外包含了带名称子组的键。

如果正则表达式里，子组名称重名了，则仅最右侧的自组储存在$matches[NAME]中。

以上例程会输出：

Array
(
    [0] => 
    [1] => bar
)

PREG_SET_ORDER

结果排序为$matches[0]包含第一次匹配得到的所有匹配(包含子组)，$matches[1]是包含第二次匹配到的所有匹配(包含子组)的数组，以此类推。

以上例程会输出：

example: , example:
this is a test, this is a test

PREG_OFFSET_CAPTURE

如果这个标记被传递，每个发现的匹配返回时会增加它相对目标字符串的偏移量。注意这会改变$matches中的每一个匹配结果字符串元素，使其成为一个第0个元素为匹配结果字符串，第1个元素为匹配结果字符串在$subject中的偏移量。

以上例程会输出：

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => foobarbaz
                    [1] => 0
                )
        )
    [1] => Array
        (
            [0] => Array
                (
                    [0] => foo
                    [1] => 0
                )
        )
    [2] => Array
        (
            [0] => Array
                (
                    [0] => bar
                    [1] => 3
                )
        )
    [3] => Array
        (
            [0] => Array
                (
                    [0] => baz
                    [1] => 6
                )
        )
)

如果没有给定排序标记，假定设置为PREG_PATTERN_ORDER。

$offset

通常，查找时从目标字符串的开始位置开始。可选参数$offset用于从目标字符串中指定位置开始搜索(单位是字节)。

Note:

使用$offset参数不同于传递substr($subject,$offset)的结果到preg_match_all()作为目标字符串，因为$pattern可以包含断言比如^，$或者(?

查找匹配的HTML标签（贪婪）

以上例程会输出：

matched: bold textpart 1: part 2: b
part 3: bold text
part 4: matched: click me
part 1: 
part 2: a
part 3: click me
part 4:

使用子命名组

以上例程会输出：

Array
(
    [0] => Array
        (
            [0] => a: 1
            [1] => b: 2
            [2] => c: 3
        )
    [name] => Array
        (
            [0] => a
            [1] => b
            [2] => c
        )
    [1] => Array
        (
            [0] => a
            [1] => b
            [2] => c
        )
    [digit] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
        )
    [2] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
        )
)

参见

PCRE 匹配
preg_quote()转义正则表达式字符
preg_match()执行匹配正则表达式
preg_replace()执行一个正则表达式的搜索和替换
preg_split()通过一个正则表达式分隔字符串
preg_last_error()返回最后一个PCRE正则执行产生的错误代码

if you want to extract all {token}s from a string:

output:
Array
(
  [0] => Array
    (
      [0] => {token1}
      [1] => {token2}
    )
)

The code that john at mccarthy dot net posted is not necessary. If you want your results grouped by individual match simply use:

E.g.

Be careful with this pattern match and large input buffer on preg_match_* functions.

if $buffer is 80+ KB in size, you'll end up with segfault! 
[89396.588854] php[4384]: segfault at 7ffd6e2bdeb0 ip 00007fa20c8d67ed sp 00007ffd6e2bde70 error 6 in libpcre.so.3.13.1[7fa20c8c3000+3c000]
This is due to the PCRE recursion. This is a known bug in PHP since 2008, but it's source is not PHP itself but PCRE library. 
Rasmus Lerdorf has the answer: https://bugs.php.net/bug.php?id=45735#1365812629
"The problem here is that there is no way to detect run-away regular expressions 
here without huge performance and memory penalties. Yes, we could build PCRE in a 
way that it wouldn't segfault and we could crank up the default backtrack limit 
to something huge, but it would slow every regex call down by a lot. If PCRE 
provided a way to handle this in a more graceful manner without the performance 
hit we would of course use it."

I needed a function to rotate the results of a preg_match_all query, and made this. Not sure if it exists.

Example - Take results of some preg_match_all query:
Array
(
  [0] => Array
    (
      [1] => Banff 
      [2] => Canmore
      [3] => Invermere
    )
 
  [1] => Array
    (
      [1] => AB 
      [2] => AB
      [3] => BC
    )
 
  [2] => Array
    (
      [1] => 51.1746254 
      [2] => 51.0938416
      [3] => 50.5065193
    )
 
  [3] => Array
    (
      [1] => -115.5719757 
      [2] => -115.3517761
      [3] => -116.0321884
    )
 
  [4] => Array
    (
      [1] => T1L 1B3 
      [2] => T1W 1N2
      [3] => V0B 2G0
    )
)
Rotate it 90 degrees to group results as records:
Array
(
  [0] => Array
    (
      [1] => Banff 
      [2] => AB
      [3] => 51.1746254
      [4] => -115.5719757
      [5] => T1L 1B3
    )
 
  [1] => Array
    (
      [1] => Canmore
      [2] => AB
      [3] => 51.0938416
      [4] => -115.3517761
      [5] => T1W 1N2
    )
 
  [2] => Array
    (
      [1] => Invermere
      [2] => BC
      [3] => 50.5065193
      [4] => -116.0321884
      [5] => V0B 2G0
    )
)

Here is a awesome online regex editor https://regex101.com/
which helps you test your regular expressions (prce, js, python) with real-time highlighting of regex match on data input.

Here's some fleecy code to 1. validate RCF2822 conformity of address lists and 2. to extract the address specification (the part commonly known as 'email'). I wouldn't suggest using it for input form email checking, but it might be just what you want for other email applications. I know it can be optimized further, but that part I'll leave up to you nutcrackers. The total length of the resulting Regex is about 30000 bytes. That because it accepts comments. You can remove that by setting $cfws to $fws and it shrinks to about 6000 bytes. Conformity checking is absolutely and strictly referring to RFC2822. Have fun and email me if you have any enhancements!

For parsing queries with entities use:

Perhaps you want to find the positions of all anchor tags. This will return a two dimensional array of which the starting and ending positions will be returned.

To count str_length in UTF-8 string i use
$count = preg_match_all("/[[:print:]\pL]/u", $str, $pockets);
where
[:print:] - printing characters, including space
\pL - UTF-8 Letter
/u - UTF-8 string
other unicode character properties on http://www.pcre.org/pcre.txt

Here is a way to match everything on the page, performing an action for each match as you go. I had used this idiom in other languages, where its use is customary, but in PHP it seems to be not quite as common.

Note that the offsets returned are byte values (not necessarily number of characters) so you'll have to make sure the data is single-byte encoded. (Or have a look at paolo mosna's strByte function on the strlen manual page).
I'd be interested to know how this method performs speedwise against using preg_match_all and then recursing through the results.

i have made up a simple function to extract a number from a string..
I am not sure how good it is, but it works.
It gets only the numbers 0-9, the "-", " ", "(", ")", "."
characters.. This is as far as I know the most widely used characters for a Phone number.

please note, that the function of "mail at SPAMBUSTER at milianw dot de" can result in invalid xhtml in some cases. think i used it in the right way but my result is sth like this:
foo foo foo foo 
correct me if i'm wrong. 
i'll see when there's time to fix that. -.-

If you'd like to include DOUBLE QUOTES on a regular expression for use with preg_match_all, try ESCAPING THRICE, as in: \\\"
For example, the pattern:
'/

[\s\w\/=\\\"]*/' Should be able to match:

a b

.. with all there is under those table tags. I'm not really sure why this is so, but I tried just the double quote and one or even two escape characters and it won't work. In my frustration I added another one and then it's cool.
when regex is for longer and shorter version of a string, only one of that long and short versions is catched. when regex match occurs in one position of string, only one match is saved in matches[0] for that position. if ? is used, regex is greedy, and catches more long version, if | is used, most first matching variant is catched: ['ab', 'abc'] in $m[0] for both can be expected, but it is not so, actually they output [['ab']] and [['abc']]: array(1) { [0]=> array(1) { [0]=> string(2) "ab" } } array(1) { [0]=> array(1) { [0]=> string(3) "abc" } }
I had been crafting and testing some regexp patterns online using the tools Regex101 and a `preg_match_all()` tester and found that the regexp patterns I wrote worked fine on them, just not in my code. My problem was not double-escaping backslash characters:

鹏仔微信 15129739599 鹏仔QQ344225443 鹏仔前端 pjxi.com 共享博客 sharedbk.com

免责声明：我们致力于保护作者版权，注重分享，当前被刊用文章因无法核实真实出处，未能及时与作者取得联系，或有版权异议的，请联系管理员，我们会立即处理! 部分文章是来自自研大数据AI进行生成,内容摘自(百度百科,百度知道,头条百科,中国民法典,刑法,牛津词典,新华词典,汉语词典,国家院校,科普平台)等数据,内容仅供学习参考,不准确地方联系删除处理!邮箱：344225443@qq.com)

图片声明：本站部分配图来自网络。本站只作为美观性配图使用,无任何非法侵犯第三方意图,一切解释权归图片著作权方,本站不承担任何责任。如有恶意碰瓷者,必当奉陪到底严惩不贷!

内容声明：本文中引用的各种信息及资料（包括但不限于文字、数据、图表及超链接等）均来源于该信息及资料的相关主体（包括但不限于公司、媒体、协会等机构）的官方网站或公开发表的信息。部分内容参考包括:(百度百科,百度知道,头条百科,中国民法典,刑法,牛津词典,新华词典,汉语词典,国家院校,科普平台)等数据,内容仅供参考使用,不准确地方联系删除处理！本站为非盈利性质站点,本着为中国教育事业出一份力,发布内容不收取任何费用也不接任何广告!)