Regular Expression Options

Posted on 2015-07-07 Edited on 2023-11-24 In Regular Expression Views: Disqus:

inline options 有五種

Option	Description	RegexOptions
i	Use case-insensitive matching.	IgnoreCase
m	Use multiline mode. ^ and $ match the beginning and end of a line, instead of the beginning and end of a string.	Multiline
n	Do not capture unnamed groups.	ExplicitCapture
s	Use single-line mode.	Singleline
x	Ignore unescaped white space in the regular expression pattern.	IgnorePatternWhitespace

RegexOptions 是一個列舉，其有十種狀態

一、None

其實就是 regular expression engine 的預設設定，

而 regular expression engine 的預設設定為

not ECMAScript、LeftToRight、not IgnoreCase、

not IgnorePatternWhitespace、not CultureInvariant、not ExplicitCapture、

「.」被識為任一字元，除了「跳行」之外。

二、IgnoreCase

原本 regular expression engine 是 case-sensitive，

設定 RegexOptions.IgnoreCase 可讓 regular expression engine 不區分大小寫。

三、Multiline

使用 Multiline 前

using System;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            string pattern = @"^1{1}";
            string str = "123\n123\u000a123";
            string result = Regex.Replace(str, pattern, "*");
            Console.WriteLine(result);
        }
    }
}

其結果為

使用 Multiline 後

using System;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            RegexOptions regexOptions = RegexOptions.Multiline;
            string pattern = @"^1{1}";
            string str = "123\n123\u000a123";
            string result = Regex.Replace(str, pattern, "*", regexOptions);
            Console.WriteLine(result);
        }
    }
}

其結果為

說明：

1、當設成 RegexOptions.Multiline 時，「^」與「$」的意思將由一字串的開頭與結尾，

轉變為每行字串的開頭與結尾。

2、RegexOptions.Multiline 與 RegexOptions.Singleline 不是互斥關係。

四、ExplicitCapture

有設定 RegexOptions.ExplicitCapture 時，regular engine 將只取得有做「命名群組」的值，

非「命名群組」將不會被截取。以下示範其差別。

using System;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = "abc-123,def-456,";
            string pattern = @"([a-z]{3})-(?<MyName>\d{3}),";

            MatchCollection matchCollection = Regex.Matches(input, pattern);

            foreach (Match match in matchCollection)
            {
                for (int groupsCount = 1; groupsCount < match.Groups.Count; groupsCount++)
                {
                    Console.WriteLine("   Group {0}:  {1}", groupsCount, match.Groups[groupsCount].Value);
                    int captureCount = 0;
                    foreach (Capture capture in match.Groups[groupsCount].Captures)
                    {
                        Console.WriteLine("      Capture {0}: {1}", captureCount, capture.Value);
                        captureCount++;
                    }
                }
                Console.WriteLine();
            }
        }
    }
}

其結果為

using System;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = "abc-123,def-456,";
            string pattern = @"([a-z]{3})-(?<MyName>\d{3}),";

            MatchCollection matchCollection = Regex.Matches(input, pattern, RegexOptions.ExplicitCapture);

            foreach (Match match in matchCollection)
            {
                for (int groupsCount = 1; groupsCount < match.Groups.Count; groupsCount++)
                {
                    Console.WriteLine("   Group {0}:  {1}", groupsCount, match.Groups[groupsCount].Value);
                    int captureCount = 0;
                    foreach (Capture capture in match.Groups[groupsCount].Captures)
                    {
                        Console.WriteLine("      Capture {0}: {1}", captureCount, capture.Value);
                        captureCount++;
                    }
                }
                Console.WriteLine();
            }
        }
    }
}

其結果為

五、Compiled

在 .Net，Regular expression 預設是「直譯式」，

如果你的程式需要被執行多次，有效能考量時，

請設成「編譯式」(Compiled)。

六、Singleline

using System;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            string pattern = @".";
            string str = "1\n1\u000a1";
            string result = Regex.Replace(str, pattern, "*");
            Console.WriteLine(result);
        }
    }
}

其結果為

說明：

當沒有設定 RegexOptions.Singleline 時，「.」的意思是任一字元，除了「跳行」之外。

using System;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            RegexOptions regexOptions = RegexOptions.Singleline;
            string pattern = @".";
            string str = "1\n1\u000a1";
            string result = Regex.Replace(str, pattern, "*", regexOptions);
            Console.WriteLine(result);
        }
    }
}

其結果為

說明：

當有設定 RegexOptions.Singleline 時，「.」的意思是任一字元，沒有例外。

額外一例

using System;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            RegexOptions regexOptions = RegexOptions.Singleline;
            string pattern = @".";
            string str = Environment.NewLine;
            string result = Regex.Replace(str, pattern, "*", regexOptions);
            Console.WriteLine(result);
        }
    }
}

其執行結果為

說明：

Environment.NewLine 也是「跳行」，當有設定 RegexOptions.Singleline 時，

其執行應該也只會跑出一顆星，為何是兩顆星呢？

原因在於，Environment.NewLine 等於「\r」+「\n」，自然會被識為兩字元囉。

七、IgnorePatternWhitespace

1、「」、「\s」兩者皆表示為空白，但設定了 RegexOptions.IgnorePatternWhitespace之後，

regular expression engine 只會認得「\s」這 pattern 了。

2、當設定了 RegexOptions.IgnorePatternWhitespace之後，

一個 pattern 的「#」字號後面皆為註解。

using System;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = "I'm comment";
            string pattern = "[\\s]#I'm comment";
            string result = Regex.Replace(input, pattern, "*", RegexOptions.IgnorePatternWhitespace);
            Console.WriteLine(result);
        }
    }
}

其結果為

3、就算是設定了 RegexOptions.IgnorePatternWhitespace 「」也不能視而不見

using System;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = " ";
            string pattern = "[ ]";
            string result = Regex.Replace(input, pattern, "*", RegexOptions.IgnorePatternWhitespace);
            Console.WriteLine(result);
        }
    }
}

其結果為

4、通常會設定 RegexOptions.IgnorePatternWhitespace 是為了 pattern 的可讀性。

八、RightToLeft

using System;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = "bot tob big ";
            string pattern = @"\bb\w+\s";
            Match m = Regex.Match(input, pattern, RegexOptions.RightToLeft);

            while (m.Success)
            {
                Console.WriteLine(m.Value);
                m = m.NextMatch();
            }
        }
    }
}

其結果為

說明：

當設定了 RegexOptions.RightToLeft 時，表示比對是從字串的右邊比到左邊，

而不是連 pattern 也要從右一個個單元比到左邊。

九、ECMAScript

RegexOptions.ECMAScript 可以和

RegexOptions.IgnoreCase 與 RegexOptions.Multiline 一起設定，

而其餘的 RegexOptions 皆不得和 RegexOptions.ECMAScript 一起設定。

ECMAScript 與 canonical regular expressions 有三個不同

1、character class syntax

ECMAScript 不支援 Unicode Property，而 canonical regular expressions 支援。

2、self-referencing capturing groups

using System;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static string pattern;

        static void Main(string[] args)
        {
            string input = "aa aaaa aaaaaa ";
            pattern = @"((a+)(\1) ?)+";

            // Match input using canonical matching.
            AnalyzeMatch(Regex.Match(input, pattern));

            // Match input using ECMAScript.
            AnalyzeMatch(Regex.Match(input, pattern, RegexOptions.ECMAScript));
        }

        private static void AnalyzeMatch(Match m)
        {
            if (m.Success)
            {
                Console.WriteLine("'{0}' matches {1} at position {2}.",
                                  pattern, m.Value, m.Index);
                int grpCtr = 0;
                foreach (Group grp in m.Groups)
                {
                    Console.WriteLine("   {0}: '{1}'", grpCtr, grp.Value);
                    grpCtr++;
                    int capCtr = 0;
                    foreach (Capture cap in grp.Captures)
                    {
                        Console.WriteLine("      {0}: '{1}'", capCtr, cap.Value);
                        capCtr++;
                    }
                }
            }
            else
            {
                Console.WriteLine("No match found.");
            }
            Console.WriteLine();
        }

        // The example displays the following output:
        //    No match found.
        //    
        //    '((a+)(\1) ?)+' matches aa aaaa aaaaaa  at position 0.
        //       0: 'aa aaaa aaaaaa '
        //          0: 'aa aaaa aaaaaa '
        //       1: 'aaaaaa '
        //          0: 'aa '
        //          1: 'aaaa '
        //          2: 'aaaaaa '
        //       2: 'aa'
        //          0: 'aa'
        //          1: 'aa'
        //          2: 'aa'
        //       3: 'aaaa '
        //          0: ''
        //          1: 'aa '
        //          2: 'aaaa '
    }
}

3、octal versus backreference interpretation

Regular expression	Canonical behavior	ECMAScript behavior
\0 followed by 0 to 2 octal digits	Interpret as an octal. For example, \044 is always interpreted as an octal value and means "$".	Same behavior.
\ followed by a digit from 1 to 9, followed by no additional decimal digits,	Interpret as a backreference. For example, \9 always means backreference 9, even if a ninth capturing group does not exist. If the capturing group does not exist, the regular expression parser throws an ArgumentException.	If a single decimal digit capturing group exists, backreference to that digit. Otherwise, interpret the value as a literal.
\ followed by a digit from 1 to 9, followed by additional decimal digits	Interpret the digits as a decimal value. If that capturing group exists, interpret the expression as a backreference. Otherwise, interpret the leading octal digits up to octal 377; that is, consider only the low 8 bits of the value. Interpret the remaining digits as literals. For example, in the expression \3000, if capturing group 300 exists, interpret as backreference 300; if capturing group 300 does not exist, interpret as octal 300 followed by 0.	Interpret as a backreference by converting as many digits as possible to a decimal value that can refer to a capture. If no digits can be converted, interpret as an octal by using the leading octal digits up to octal 377; interpret the remaining digits as literals.

十、CultureInvariant

各國的語言不同，有可能 regex engine 在判斷上也有所差異，

要解決其問題可去設定 RegexOptions.CultureInvariant。