A word boundary
\b is a test, just like
When the regexp engine (program module that implements searching for regexps) comes across
\b, it checks that the position in the string is a word boundary.
There are three different positions that qualify as word boundaries:
- At string start, if the first string character is a word character
- Between two characters in the string, where one is a word character
\wand the other is not.
- At string end, if the last string character is a word character
For instance, regexp
\bJava\b will be found in
Hello, Java!, where
Java is a standalone word, but not in
In the string
Hello, Java! following positions correspond to
So, it matches the pattern
- At the beginning of the string matches the first test
- Then matches the word
- Then the test
\bmatches again, as we’re between
oand a comma.
So the pattern
\bHello\b would match, but not
\bHell\b (because there’s no word boundary after
l) and not
Java!\b (because the exclamation sign is not a wordly character
\w, so there’s no word boundary after it).
We can use
\b not only with words, but with digits as well.
For example, the pattern
\b\d\d\b looks for standalone 2-digit numbers. In other words, it looks for 2-digit numbers that are surrounded by characters different from
\w, such as spaces or punctuation (or text start/end).
\bdoesn’t work for non-latin alphabets
The word boundary test
\b checks that there should be
\w on the one side from the position and "not
\w" – on the other side.
\w means a latin letter
a-z (or a digit or an underscore), so the test doesn’t work for other characters, e.g. cyrillic letters or hieroglyphs.