Java - String pattern matching by Boyer-Moore, Java String indexOf source code

最近看了一下河西朝雄先生的演算法實習應用，裡面探討到核對字串的處理，因此在這邊

就提一下一般的核對法及Boyer-Moore演算法，最後則是Java String indexOf原始碼的比對方式

，測試一下它們的執行時間。

基本上，字串的核對在這邊我們的例題是只針對英文字母的字串來處理(不是中文字orz)，

如下：

String text = "Hello! array. Learning array knowledge.";

String key = "array";

假設被核對的字串為變數text，欲核對的字串為key，接下來以三種方法來實作之。

1. 一般核對法

基本上我們要核對text字串是否有符合key的部分，最簡單的方式為先看key的長度

，這邊為5，那麼一開始從text的0 ~ 5字串位置開始核對，即

text.substring(0 , 5) => 取出"Hello"，與"array"比對結果不符合！！

再來你無法保證1 ~ 6的位置取出的字串會不符合，再來只能摸摸鼻子繼續努力！

text.substring(1 , 6) => 取出"ello!"，與"array"比對結果還是不符合！！

就這樣一路比對下去，不放過任何一個機會！

PS. 字串的index位置從0(字元陣列)開始，以endIndex - startIndex來表示要取的長度

若0 ~ 5，表示0開始，取長度5-0 = 5，以此類推！

public void stringMatchMethod1(String key, String text){
   int n = text.length()-key.length();
   int keyLen = key.length();
   long startTime = System.currentTimeMillis();
   for(int i = 0 ; i < n ; i++){
      String find = text.substring(i, i+keyLen);
      if(key.equals(find)){
         System.out.println("*** Match:"+find+" => "+text.substring(i));
      }else{
         System.out.println("Not Match:"+find);
      }
   }
   float time = (System.currentTimeMillis()-startTime) /1000.0f;
   System.out.println("執行時間:"+time+"秒");
}

從上面的source code會得到如下的結果，

Not Match:Hello
Not Match:ello!
Not Match:llo!
Not Match:lo! a
Not Match:o! ar
Not Match:! arr
Not Match: arra
*** Match:array => array. Learning array knowledge.
Not Match:rray.
Not Match:ray.
Not Match:ay. L
Not Match:y. Le
Not Match:. Lea
Not Match: Lear
Not Match:Learn
Not Match:earni
Not Match:arnin
Not Match:rning
Not Match:ning
Not Match:ing a
Not Match:ng ar
Not Match:g arr
Not Match: arra
*** Match:array => array knowledge.
Not Match:rray
Not Match:ray k
Not Match:ay kn
Not Match:y kno
Not Match: know
Not Match:knowl
Not Match:nowle
Not Match:owled
Not Match:wledg
Not Match:ledge
執行時間:0.005秒

2. Boyer-Moore 法

這個方法，一樣先從text的 0 ~ 5的字串位置取出來與key比對，不過有些許差異，

text.substring(0 , 5) => 取出"Hello"，與"array"比對結果不符合！！

但是下一次決定要去取text的字串位置就不一定是從1 ~ 6囉！

會根據當下第5個字元是啥，上面是"Hello" => 'o'，那麼表示該字元與要核對的字串

"array"是毫無任何關係，因此想都不用想地可以向後推進5個字元，以6 ~ 10的位置來比對

PS. 由於取1 ~ 6的位置還是會包含第5個位置的'o'，那麼也就不用核對囉！

text.substring(5 , 10) => 取出"! arr"，與"array"比對結果還是不符合！！

不過，"! arr"的最後一個字元為'r'，這個字元與"array"有關係，因此接下來只要推進

2個字元(為什麼？下面會提到)就有可能會得到與字串"array"相符合的字串！

text.substring(7 , 12) => 取出"array"，與"array"比對結果符合！！

再來，"array"的最後一個字元'y'，這個字元雖然與"array"有關係，不過由於它已經是

key的最後一個字元，因此視為與其他非相關的字元一樣碰到它也是要推進5個字元！

因此，採用Boyer-Moore法，核對的方式是不固定的次數，但若以第一種方法來看的話

假設text長度為39，那麼至少得比對34次(39 - 5(key length) = 34)之多！

而我們事先要知道碰到'r'只需前進2個字元，若是'h'字元需前進5個字元，因此要先建立這個

前進陣列。而為什麼'r'只要前進兩個字元呢？因為就"array"來說，若上一個是取到"! arr"，那

麼我可以猜測出下兩個字元可能為剩下的"ay"，如此只需前進2個字元。

PS. 若key字串內有兩個相同的字元，如"array"，則'a'字元的前進長度以越後面的字元為主

，即一開始'a' => 4，但後來又有個'a' => 1，那麼以較小的值為主，比較保險！

在這邊我建立的record array長度為127，record的index為字元ascii的十進位數值大小，

如'a' 換算ascii為97 => record[97] = 1。

public void createTable(String key){
   int len = key.length();
   for(int i = 0 ; i < records.length ; i++){
      records[i] = len;
   }
   for(int i = 0 ; i < len-1 ; i++){ //last char set default value
      records[key.charAt(i)] = len-i-1;
   }
}
 
public void stringMatchMethod2(String key, String text){
   createTable(key);
   int n = text.length();
   int m = key.length();
   int p = m-1; 
   long startTime = System.currentTimeMillis();
   while(p < n){
 String find = text.substring(p-m+1, p+1);
 if(key.equals(find)){
    System.out.println("*** Match:"+find+" => "+text.substring(p-m+1));
 }else{
    System.out.println("Not Match:"+find);
 }
 p = p + records[text.charAt(p)];
   }
   float time = (System.currentTimeMillis()-startTime) /1000.0f;
   System.out.println("執行時間:"+time+"秒");
}

從上面的source code會得到如下的結果，

Not Match:Hello
Not Match:! arr
*** Match:array => array. Learning array knowledge.
Not Match:. Lea
Not Match: Lear
Not Match:earni
Not Match:ng ar
Not Match: arra
*** Match:array => array knowledge.
Not Match: know
Not Match:ledge
執行時間:0.002秒

在執行時間上明顯比第一種方法快，有可能有幾種原因：

1. while(p < n){ } 並非執行34次(第一種則是)

2. 迴圈執行次數少，相對地if(key.equals(find))的equals的比對次數也會減少。請想想

當你呼叫Java equals的method，其實就是利用Java幫你去比對equals內的字元是否與

該字串本身是相等的！

equals method的片段程式碼如下：

while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;

n => 字串本身的長度

v1 => 字串本身的字元陣列

v2 => 另一個字串轉換後的字元陣列

因此多呼叫equals等於在多花比對時間！

3. Java indexOf source code

我們都知道Java String class的indexOf本身就有提供字串搜尋的功能，會回傳當下搜尋字串

的第一個字元位置！

整理了一下他的source code處理的概念，套用到這個字串核對的例子，如下：

public void stringMatchMethod3(String key, String text){
   int n = text.length();
   int m = key.length();
   char first = key.charAt(0);
   int max = n - m;
   long startTime = System.currentTimeMillis();
   for(int i = 0 ; i < max ; i++){
      if(text.charAt(i) != first){
         while(++i <= max && text.charAt(i) != first);
      }
      if(i <= max){
         int j = i + 1; //second char start
         int end = j + m - 1;
         for (int k = 1 ; j < end && text.charAt(j) == key.charAt(k); k++, j++);
         if(j == end)
            System.out.println("*** Match:"+text.substring(i, end));
         else
            System.out.println("Not Match:"+text.substring(i, end));
      }
   }
   float time = (System.currentTimeMillis()-startTime) /1000.0f;
   System.out.println("執行時間:"+time+"秒");
}

上面的程式是indexOf method的實作方法(Java Source code)，步驟如下：

1. 首先取得key與text相符的第一個字元，如下

while(++i <= max && text.charAt(i) != first);

2. 接著，假設取得的第一個字元，i = 7，會進入此迴圈比對接續的字元

for (int k = 1 ; j < end && text.charAt(j) == key.charAt(k); k++, j++);

若接續的第2 ~ 5的字元有任一一個不符就跳出，若成功的一一核對完畢，

那麼表示j = end，就符合if(j == end)的條件！

從上面的source code會得到如下的結果，

*** Match:array
Not Match:ay. L
Not Match:arnin
*** Match:array
Not Match:ay kn
執行時間:0.001秒

========================結論==========================

從上面的三種字串核對的方法比較結果，發現第三種方法是最快的，再來是2 , 1

，第三種會比第二種更快有可能是，indexOf內完全沒有額外用到equals字串比對，完全

是自己本身的字元比對

當然，就空間複雜度而言，最差的首當第一種方法，就substring的使用頻率最高，

由於substring回傳回來的字串是重新new一個String object。如下：

return ((beginIndex == 0) && (endIndex == value.length)) ? this
: new String(value, beginIndex, subLen);

，除非他substring取的範圍為0 ~ 字串的長度，才回傳他自己本身XDDD

因此substring應該少用之比較好！

最後，關於執行效率與空間複雜度好壞的討論純屬個人的推論，僅供參考XDD

Zhi-Bin's 談天說地

搜尋此網誌

Java - String pattern matching by Boyer-Moore, Java String indexOf source code

留言

張貼留言