After reading this blog post on a bug in Github and Unicode, I started playing more and more with Unicode (even bought two domains).
Recently, I had a Eureka moment while camping and started wondering: “what was the impact of those uppercase and lowercase transformations on regular expression?”
First, let’s say your website wants to ensure that an URL provided is part of a list of trusted URLs (to avoid SSRF or as part of a CORS policy). Your website can use a list of predefined URLs, but this quickly gets tedious. So after a while, you decide to move to a regular expression. You check that the host in the URL ends with your domain. Your code looks something like this:
host =~ /domain.tld$/
For whatever reasons, you decide to add the i
or re.IGNORE_CASE
flag to make sure both domain.tld
and DOMAIN.TLD
will work (and even DoMaIn.Tld
). Your regular expression ends up looking like:
host =~ /domain.tld$/i
This could also be used if you want to ensure an email address is part of your domain.
email =~ /domain.tld$/i
The domain domaın.tld
contains a LATIN SMALL LETTER DOTLESS I (U+0131) in place of the i
.
The answer depends on the programming language used (and the version).
In Python 3.8.1, domaın.tld
will match 'domain.tld$', re.IGNORECASE
. ſ will match s and K (Kelvin sign) will match k
In Ruby 2.7.0, domaın.tld
will NOT match /domain.tld$/i
. However, ſ will match s and K (Kelvin sign) will match k.
In Golang 1.13.8, domaın.tld
will NOT match '(?i)domain.tld$'
. However, ſ will match s and K (Kelvin sign) will match k.
In node 13.8.0, domaın.tld
will NOT match /domain.tld$/i
, ſ will not match s and K (Kelvin sign) will not match k.
My advice, try to avoid using the i
or IGNORECASE
if you can for developers and make sure you test for this for pentesters and bounty hunters!