skip to Main Content
This website use cookies which are necessary to its functioning and required to achieve the purposes illustrated in the privacy policy. To learn more or withdraw consent please click on Learn More. By continued use of this website you are consenting to our use of cookies.

How to Regex: A Practical Guide to Regular Expressions (Regex) for Hackers

How To Regex: A Practical Guide To Regular Expressions (Regex) For Hackers

Regular Expressions (a.k.a regex, or regexp) is one of those things that has a fairly steep learning curve, but once you dedicate an hour or so to learning the basics, you will find that you will be far more efficient with everyday tasks. By the time you finish reading this blog, hopefully you will have a practical understanding of:

  • Regex fundamentals
  • How to use regex in a practical sense
  • How to bypass regex-based security controls

Let’s go!

What is a Regex?

A regex is a string of characters that defines a search pattern. The most basic example would be a straight string match, for example “abc”.

Regex StringMatchesDoesn’t Match
abcabcbbc

Of course, if regex was only capable of doing straight string matches, it wouldn’t be very useful! Here’s another example:

Regex StringMatchesDoesn’t Match
.bcabc
bbc
abb
acc

You may have suspected this already, but the “.” dot character is a wildcard, it can be replaced with any single character and the search string will still match! Now regex is looking slightly more useful, but we haven’t even scratched the surface yet!

When is Regex Useful?

Oftentimes when hacking, you will find yourself in a situation where you need to parse or edit some text in an automated fashion.

As an example, let’s use a list of URLs as our input.

https://example.com/test
https://www.bugcrowd.com/?param=value
http://hakluke.com
ftp://EXAMPLE.org?test=success

We are using a tool that requires domain names, not URLs. We need to somehow extract the domain names from these URLs, but how? As you can see, the URLs in this list are quite varied, there are three different schemas (http, https and ftp), some have directories, some do not, some have parameters, others do not.

We could open it up in a text editor and manually remove them, but what if there are 1 million URLs? Do we waste a day doing this? Train monkeys? No, we don’t train monkeys. We are civilized humans. Civilized humans use regex!

There are many different ways to do this, but here’s the regex string I came up with:

(?<=:\/\/)(?i)[a-z,.]*

See below for the regex in action:

If you’re new to regex, the example above will probably provide more questions than answers – but hopefully it will give you an idea of the power of regular expressions. Read on to understand what it all means!

The Tools

Before we dive into regular expressions further, it’s important to mention a few of the basic tools that utilise regular expressions, and their common use cases.

  • sed allows you to find and replace text. You can provide it with a regular expression to find, any match will be replaced with the text you provide.
  • grep allows you to filter by regular expressions. The input could be a large chunk of text, and the output will only contain the filtered results. It’s also good to know that pcregrep exists, which uses the feature-rich Perl variant of regex to parse the input. If you’re using GNU grep, you should be able to use grep -P instead, which should have the same result.
  • Regex101.com is a website that allows you to test out your regular expressions.

The Basics

Regular ExpressionMeaning
aThe literal character “a”
AThe literal character “A”
.Any single character
[a-z]Any lowercase alphabet character
[A-Z]Any uppercase alphabet character
[A-F]One of the following: A, B, C, D, E or F
[0-9]Any single digit number
[a-fA-F0-9]Any hexadecimal character
^The start of a line
$The end of a line
*Match zero or more of the preceding expression
+Match one or more of the preceding expression

Even just knowing the expressions above, we can construct some very useful queries. For example, let’s say we are parsing a file with the following contents:

Monday
Green
Tuesday
Weather
Wednesday
Thursday
Friday
Saturday
Sunday
Pinecone

If we wanted to extract the words ending with “day”, we could use the following regex:

.*day$

Let’s go through what this actually does:

ExpressionMeaning
.Any single character
*Zero or more
dayLiterally “day”
$The end of a line

When we put it all together we get “Zero or more characters followed by ‘day’ followed by the end of a line”. Below is an image of the results in regex101.com. The blue highlights indicate a matched string.

Looking good! But if we are ingesting more dynamic data, there’s a good chance this will break. One of the issues with this method is that it matches any line ending with “day”, not just single words. For example:

Of course, this is not ideal, because we are wanting to extract only weekdays.

Case Sensitivity

If we wanted to be more specific and not include the rest of the line, we could try something like this:

[a-z]*day$

Except something annoying will happen:

There are two problems here:

  • The first letter of each line is not selected (because it is uppercase, and therefore does not satisfy [a-z])
  • The “day” at the end of the line is selected over the “Saturday” at the start of the line

We can solve the first issue by switching off case sensitivity by preceding the expression with (?i). For example:

We could also have explicitly included uppercase letters in the search, which would have the same result:

We still have the second problem though, where “day” is chosen over “Saturday”. You might think that removing the $ at the end of our regex would fix this problem, but it doesn’t quite work because now both “Saturday” and “day” are selected:

Quantifiers

One way to solve this problem would be to use a different quantifier. Currently the quantifier we are using is *. Here is a list of quantifiers:

QuantifierMeaningExampleMatchesDoesn’t match
*Zero or morea*bb,ab,aabbb,abb
+One or morea+bab,aabb,abb
?Once or not at alla?bb,abaab,bb
{5}Exactly 5 timesa{5}baaaaabab,aaaaaab
{3,6}3 to 6 timesa{3,6}baaab,aaaaabaab,aaaaaaab
{3,}3 or more timesa{3,}baaab,aaaaaaabaab,bb

So we could solve this issue by using the “+” quantifier instead of the “*” quantifier. This would mean that “day” would not be matched, because it does not have an alphabetical character preceding it. Remember that + is the same as * except that + requires one or more instance of the preceding expression, while * requires zero or more. Let’s try out the +:

Great! The word “day” is no longer matched by itself. Now we have another issue though. Words like “Happyday” or “Holiday” would still be matched.

The OR operator

We can match one expression OR another expression by using the OR operator, for example (this|that) would match “this” or “that”. To match all of the weekdays explicitly, we could do something like this:

Yes! Finally it is working! Now we can just add case insensitivity and we’re all done! Wait.. what is this?:

We’re getting warmer, but we’re still not there.

The NOT operator

To fix this, we want to make sure that none of the characters directly before the weekday name are letters or numbers. We can do this by using the NOT operator, which is the ^ character. For example:

Now “Summonday” is not selected, but we have another two issues to solve:

  • The whitespace character before the weekday name is selected (space or new line)
  • Monday is not selected, because it is at the start of the file

So how can we check if the preceding character is not an alphabet character without actually selecting it?

Lookaheads and Lookbehinds

To achieve this, we need to use what is known as a “negative lookbehind”. It is called a “lookbehind” because it checks the expression immediately before the match, but without actually including it as part of the match. Before we dive into negative lookbehinds, let’s give an example of a positive lookbehind.

The syntax is: 

(?:<thing you want to check for>)<rest of expression>

For example:

As you can see, “hacker” is only matched if it is preceded with an “a”. A negative lookbehind does exactly the opposite, and the syntax is:

(?<!thing you want to check for>)<rest of expression>

The example above is a “negative lookbehind”, so the word hacker will match as long as it is NOT preceded by an “a”.

So! To solve the issues with our original problem, we can use the following:

(?<![a-zA-Z0-9_])(?i)(Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day

This will match any weekday that is not preceded by a-z, A-Z 0-9 or an underscore _.

And here is the outcome:

Bypassing Regex-Based Security Controls

At a glance, writing custom regex may appear to be a great solution for implementing various types of input validation. Above, while learning some basic regex, we have demonstrated that there are many edge cases that need to be accounted for. As such, it is generally recommended that using regex for input validation is avoided in favour of allowlists where possible.

To demonstrate why this is generally a bad idea, let’s say that we need to validate that a URL belongs to either bugcrowd.com, or a subdomain of bugcrowd.com. If the regex matches the URL, we will allow it, otherwise we will deny it. What regex should we use?

Many developers make the mistake of using something like this:

(https://|http://).*bugcrowd.com.*

Of course, this can be easily bypassed by prepending or appending text that results in a completely different domain, such as “https://notbugcrowd.com” or “https://bugcrowd.com.hakluke.com”.

If you do ever run into a situation where your input is being checked or filtered in some way, through trial and error, see if you can figure out how it is being filtered. There’s a chance that regex has been used, and that means there’s a chance that it can be bypassed. Get creative!

Practical Regex Example

As a hacker, I most often use regex for filtering large chunks of text to parse out useful information. For example, below we curl https://www.bugcrowd.com and then use some regex magic to extract all of the URLs out of the response:

The regular expression I used here is:

(http:\/\/|https:\/\/).*?(?="|'| )

The tool that I used is “pcregrep”. It is basically the same as “grep” except it uses the PCRE regular expression library, which has more features. If you’re using GNU grep, you should be able to use grep -P instead.

Stay in Touch

At Bugcrowd, we post these kinds of how-to articles fairly frequently! If you’d like to learn more, you can join our Discord, follow us on Twitter, or check out our video content on YouTube.

If you’d like to see more from the author personally, follow hakluke on Twitter, YouTube, Instagram or check out his website.

Tags:
Topics:

Luke Stephens

Training and Quality Assurance Manager, Security Operations at Bugcrowd

Back To Top