Stacklok Insight is a free-to-use web app that provides data and scoring on the supply chain risk for open source packages.
If you reviewed a patch like this, would you be concerned about the security vulnerability introduced?
function sanitizeUsername(username) {
// Only allow DNS-type names
disallowed = /[^-a-z0-9]/;
usernаme = username.replace(disallowed, ‘-’);
return username;
}
If you didn’t spot the vulnerability, would you be surprised to learn that this function does not remove any characters from usernames? What you don’t see is that the а
in the assignment to usernаme
uses a unicode character, U+0430
, also known as CYRILLIC SMALL LETTER A.
This means that the replacement assigns to an otherwise-unused variable, and then the input is returned untouched!
A homoglyph refers to characters that appear visually similar or identical to each other, but are represented by different Unicode code points (for example, CYRILLIC SMALL LETTER A (U+0430)
and LATIN SMALL LETTER A (U+0061)
. Many programming languages support internationalized characters for variable names to allow programmers from all over the world to express themselves using their native language. Unfortunately, malicious actors can attack projects using these homoglyphs:
By introducing invisible characters into lines of code
By mixing scripts in code, like using a visually similar character from a different script to camouflage malicious intent
By including bidirectional Unicode text, which can be compiled differently than it appears in a code review (see the vulnerability CVE-2021-42574)
Organizations like GitHub, Red Hat, and Rust have all taken steps to proactively mitigate certain homoglyph attacks, but the threat still exists. Despite these advancements, other vulnerabilities remain unaddressed, presenting opportunities for crafty attackers. Among these unaddressed issues is the attack presented at the beginning of this post, assigned CVE-2021-42694. While we don’t want to undo the progress in making programming more accessible to all backgrounds over the last 20 years, the Minder team aims to implement thoughtful protections, focusing on these attacks without limiting everyone to US ASCII.
Minder is introducing new rule types to analyze your pull requests for homoglyph attacks, providing a proactive defense against these threats. These rules, which detect the use of invisible characters and mixed scripts, can be used to catch homoglyph attacks before they’re introduced into your codebase. If you are interested in the complexities of such vulnerabilities, we recommend checking this academic paper titled Trojan Source: Invisible Vulnerabilities, published by a group of researchers at the University of Cambridge.
Below, we’ll explore why invisible characters and mixed scripts are threats, and how Minder can help.
Invisible characters, while seemingly harmless, present a sophisticated vector for security threats within software development. These characters can disrupt the logical execution of code, creating a chasm between expected and actual behavior. This subtle form of attack exploits the rendering behavior of Unicode characters and can be abused to create security vulnerabilities. These vulnerabilities are adept at evading even the most thorough code reviews.
Consider the JavaScript function getUserAccessLevel
, which seemingly assigns a "basic" access level to every user. However, embedding an invisible character within the return value allows an adversary to subvert logical comparisons, as shown in the function assignAccessLevel
. This clever manipulation ensures that the returned access level string is never strictly equal to "basic,” thereby granting elevated privileges without raising suspicion.
(Tip: You can use a Unicode decoder tool like this one from Babelstone to identify invisible characters.)
function getUserAccessLevel(username) {
// Returns "basic" access with an "invisible plus" character (U+2064) appended.
return "basic";
}
function assignAccessLevel(access_level) {
if (access_level === "basic") {
console.log("You are not an admin.");
} else {
console.log("You are an admin.");
}
}
let username = "user";
let access_level = getUserAccessLevel(username);
assignAccessLevel(access_level);
Homoglyph attacks make deceptive use of mixed scripts, utilizing characters that are visually similar across different scripts to camouflage malicious intent within code. These attacks can craft functions or variables that appear legitimate, but redirect program execution toward shady ends, a concern elaborated in both the Trojan Source paper above and in the Unicode Consortium's report on Mixed Script Detection.
A seemingly harmless function, secureHashPassword in Node.js, might be compromised by the introduction of a homoglyph in the variable name. The Cyrillic 'а' (U+0430
) replaces the Latin 'a' (U+0061
), allowing the function to execute without error, but failing to return the expected hashed password. Instead, the original password might be leaked, demonstrating a clear and present danger in the handling of sensitive information.
const crypto = require('crypto');
function secureHashPassword(password) {
const salt = crypto.randomBytes(16).toString('hex');
// The variable below has a Cyrillic 'а', making it a homoglyph.
pаssword = crypto.pbkdf2Sync(password, salt, 1000, 64, `sha512`).toString(`hex`);
// Returns the original password instead of the hashed one.
return password;
}
let userPassword = "mySecurePassword";
let hashedPassword = secureHashPassword(userPassword);
console.log(`Original password: ${userPassword}`);
console.log(`Hashed password: ${hashedPassword}`);
Minder has added two new rule types designed to automatically detect these attacks in your code, and comment on PRs when they are detected. These rules are integrated into a dedicated profile aimed at defending against homoglyph attacks, but you can also incorporate them into your own profiles:
version: v1
type: profile
name: homoglyphs-github-profile
context:
provider: github
alert: "off"
remediate: "off"
pull_request:
- type: invisible_characters_check
params: {}
def: {}
- type: mixed_scripts_check
params: {}
def: {}
Our first rule tackles invisible characters, those elusive elements that attempt to infiltrate your code undetected.
Let’s assume we have this potentially dangerous line as part of our pull request change patch and we’ve activated the “homoglyphs” profile to guard our repo:
This rule type looks as follows:
version: v1
type: rule-type
name: invisible_characters_check
context:
provider: github
description: |
For every pull request submitted to a repository, this rule will
check if the pull request adds a new change patch with invisible characters.
If it does, the rule will fail and the pull request will be commented on.
guidance: |
Detects and highlights the use of invisible characters
that could potentially hide malicious code.
The characters classified as "invisible" can be found at
https://invisible-characters.com/
For more information on the potential security implications, see
https://www.usenix.org/system/files/usenixsecurity23-boucher.pdf
def:
in_entity: pull_request
param_schema:
properties: {}
rule_schema: {}
ingest:
type: diff
diff:
type: full
eval:
type: homoglyphs
homoglyphs:
type: invisible_characters
Minder's approach is pretty straightforward: it highlights the potentially malicious segments of the code and annotates directly on the implicated line as follows:
To detect these, we employ a straightforward in-memory database, mapping content elements that raise red flags as potential issues. For full transparency, we source the Unicode points for 'invisible characters' from https://invisible-characters.com.
Next, let's delve into our second rule, pinpointing characters that look similar yet differ in their underlying code.
version: v1
type: rule-type
name: mixed_scripts_check
context:
provider: github
description: |
For every pull request submitted to a repository, this rule will
check if the pull request adds a new change patch
that contains mixed scripts.
If it does, the rule will fail and the pull request will be commented on.
guidance: |
Detects and highlights the use of strings with mixed scripts
that could potentially hide malicious code.
For more information, see
https://unicode.org/reports/tr39/#Mixed_Script_Detection
and
https://www.usenix.org/system/files/usenixsecurity23-boucher.pdf
def:
in_entity: pull_request
param_schema:
properties: {}
rule_schema: {}
ingest:
type: diff
diff:
type: full
eval:
type: homoglyphs
homoglyphs:
type: mixed_scripts
Like the previous instance, Minder annotates lines containing potentially harmful strings with mixed scripts, clarifying the specific strings and scripts involved. It’s important to note that Minder disregards the 'Common' script as inconsequential, omitting its presence in strings.
For this functionality, we again utilize a simple in-memory database, correlating each Unicode character with its script category. This database is compiled through careful manipulation of the Unicode Character Database, ensuring comprehensive coverage by explicitly generating any absent characters.
By enabling the "homoglyphs" profile, you effectively enlist Minder's vigilant duo to safeguard your code. This feature ensures continuous protection against these sophisticated threats, maintaining the integrity of your GitHub repositories. 🎉🛡️
Get started with Minder today:
Teodor Yanev
Software Engineer
Teodor is a software engineer at Stacklok. He is based in Bulgaria.