This guest post was authored by Charlie Eriksen, Bugcrowd researcher and CTO of Adversary. Adversary delivers a platform that provides technical security training and education opportunities with a focus on hands-on learning. In this blog he’s sharing a fantastic write up about a Unicode vulnerability along with a lab for the Crowd to learn from.

At the end of November 2019, the team at Wisdom released details about a security flaw they had discovered in GitHub. In a Nov. 28, 2019 blog post they outlined how they were able to utilize a Case Mapping Collision in Unicode in order to trigger a password reset email to be delivered to the wrong email address. On a side note, if you want to learn far more about Unicode than you ever wished to, we recommend taking a look at Wisdom’s Unicode bible.

As a researcher, you often hear about some cool vulnerability that somebody found. To be able to find and exploit especially new vulnerability classes, having hands-on experience with it is a huge benefit. Otherwise it’s a bit of a chicken and egg problem. If you haven’t exploited a vulnerability class before, it’s hard to find it. If you haven’t found it in the wild before, it’s hard to exploit it.

So, we put together a lab demonstrating this vulnerability for the whole community to play with. Given that this vulnerability class is only starting to get exposure, it’s important for researchers to have the ability to practice exploitation against vulnerabilities in a safe environment. As a co-founder of Adversary, I’m personally excited to put out content that helps educate the Crowd. Hopefully somebody will be able to put what they’ve learned to use and submit some cool findings to their favorite programs.

Enter Unicode

Unicode is not just the standard that gave us emojis everywhere. It’s the standard which creates a character set encompassing all characters/symbols you could possibly imagine, ensuring that everybody can express themselves using a single character set. Whether that’s the Icelandic þ, or the Turkish dotless i: ı.

Indeed, this letter really puts the dot on the “i” in terms of why it’s important to understand Unicode. As a result of how many characters Unicode has, which most other character sets do not have, a way of converting characters to another “equivalent” character is needed in many cases. For instance, it seems sensible that if you convert a Unicode string with a dotless “i” to ASCII, that it should simply turn into an “i”, right?

Where it goes wrong

To illustrate the vulnerability in question, let’s look at an example of code that would have this same vulnerability:

The logic goes something like this:

  1. It gets the email provided by the user and uppercases the email for consistency
  2. It checks if the email exists in the database
  3. If it does, then it will set a new temporary password (You shouldn’t do this. Instead, use a link with a token that allows for resetting the password)
  4. It then sends an email to the email fetched in step 1 with the password (You shouldn’t do this for multiple reasons)

Lets see what happens with the example provided in the original blog post, where a user requests a password reset for the email John@GıtHub.com (Note the dotless i):

  1. It converts John@Gıthub.com to JOHN@GITHUB.COM
  2. It looks that up in the database and finds the user JOHN@GITHUB.COM
  3. It generates a new password and sends it to John@Gıthub.com

Note that it ends up sending the email to the wrong email address. Oops!

How to fix this

The interesting aspect of this specific vulnerability is that there are multiple factors that make it vulnerable:

  1. There’s the actual unicode casting behaviour
  2. The logic determining email address to use (The user provided email address, versus the one in the database)

In theory, you can fix this specific issue in two ways as identified in the blog post from Wisdom:

  1. Convert the email to ASCII with punycode conversion
  2. Use the email address from the database rather than the one provided by the user

When it comes to hardening software, you want to have as many layers of defense in place as possible. For all we know, there may be other ways to exploit encoding about which we are not yet aware. Thus, it’s generally preferred to address both aspects and harden the code as much as possible. This greatly decreases risk.

What the community has done

The bug bounty community has been aware of this type of vulnerability for a while now. Tomnomnom low-key released a tool 2 years ago, unisub, which allows you to find unicode characters which might get converted into your chosen character.

0xSha released a case study of exploiting Django with unicode case transformations. He shows both the vulnerable code in the case of Django and how they addressed the issue. This goes to show that this type of issue is by no means well-explored in popular software. So understanding this type of vulnerability is important, since we will likely see a lot more vulnerabilities relating to character encoding. It’s especially worth noting that 0xSha shows that the actual behaviour differs between programming languages, so stay vigilant!