Table of Contents
If you need to match letters in a multilingual application /[a-zA-Z]/
simply won’t cut it for you:
iex> String.match?("å", ~r/^[a-zA-Z]$/)
false
Assuming that your regular expression library supports Unicode matching, you can match a Unicode grapheme in the “letter” category with \p{L}
. This will match letters of all languages, cases and types covered by Unicode: a
, å
, B
, ß
, β
, ӝ
, ჭ
, 你
, etc.
This regular expression is very useful if you need to enforce presence of letters in any language or absence of special characters in a string that could potentially be in any of the world’s written languages.
Unicode Regex in Elixir
Luckily for us Alchemists, Elixir’s Regex
module supports Unicode matching and we simply need to supply the u
option to ~r
(sigil_r
) to release its awesome power.
Basic Latin characters from English work, as you might expect:
iex> String.match?("a", ~r/^\p{L}$/u)
true
iex> String.match?("A", ~r/^\p{L}$/u)
true
Latin character variants with umlauts and acute accents are no problem either:
iex> String.match?("ö", ~r/^\p{L}$/u)
true
iex> String.match?("Á", ~r/^\p{L}$/u)
true
Let’s make sure it’s not just returning a match for any character, so how about some characters that look like letters but aren’t:
iex> String.match?("$", ~r/^\p{L}$/u)
false
iex> String.match?("@", ~r/^\p{L}$/u)
false
Be careful to remember the u
option, or the Regex will not return the expected results:
iex> String.match?("å", ~r/^\p{L}$/)
false
More cool stuff you can match with Unicode
The power to do Unicode matching gives you some pretty enormous power, since you can match on any of the Unicode categories.
For instance, you could match only lowercase letters with \p{Ll}
:
iex> String.match?("a", ~r/^\p{Ll}$/u)
true
iex> String.match?("A", ~r/^\p{Ll}$/u)
false
iex> String.match?("å", ~r/^\p{Ll}$/u)
true
iex> String.match?("Å", ~r/^\p{Ll}$/u)
false
Or match only currency symbols with \p{Sc}
:
iex> String.match?("$", ~r/^\p{Sc}$/u)
true
iex> String.match?("€", ~r/^\p{Sc}$/u)
true
iex> String.match?("#", ~r/^\p{Sc}$/u)
false
You could use this to determine whether an input string is a valid currency string without needing to manually list all possible currency symbols:
iex> currency_string_regex = ~r/\p{Sc}\d+\.\d{2}/u
~r/\p{Sc}\d+\.\d{2}/u
iex> ["$1.00", "£1.00", "¥1.00", "€1.00", "&1.00"] \
...> |> Enum.filter(&String.match?(&1, currency_string_regex))
["$1.00", "£1.00", "¥1.00", "€1.00"]
Got any examples of where Unicode matching is used in your applications? Please hit me up in the comments! 😃