Regular Expressions (a.k.a regex, or regexp) is one of those things that has a fairly steep learning curve, but once you dedicate an hour or so to learning the basics, you will find that you will be far more efficient with everyday tasks. By the time you finish reading this blog, hopefully you will have a practical understanding of:
Let’s go!
A regex is a string of characters that defines a search pattern. The most basic example would be a straight string match, for example “abc”.
Of course, if regex was only capable of doing straight string matches, it wouldn’t be very useful! Here’s another example:
You may have suspected this already, but the “.” dot character is a wildcard, it can be replaced with any single character and the search string will still match! Now regex is looking slightly more useful, but we haven’t even scratched the surface yet!
Oftentimes when hacking, you will find yourself in a situation where you need to parse or edit some text in an automated fashion.
As an example, let’s use a list of URLs as our input.
https://example.com/test https://www.bugcrowd.com/?param=value http://hakluke.com ftp://EXAMPLE.org?test=success
We are using a tool that requires domain names, not URLs. We need to somehow extract the domain names from these URLs, but how? As you can see, the URLs in this list are quite varied, there are three different schemas (http, https and ftp), some have directories, some do not, some have parameters, others do not.
We could open it up in a text editor and manually remove them, but what if there are 1 million URLs? Do we waste a day doing this? Train monkeys? No, we don’t train monkeys. We are civilized humans. Civilized humans use regex!
There are many different ways to do this, but here’s the regex string I came up with:
(?<=://)(?i)[a-z,.]*
See below for the regex in action:
If you’re new to regex, the example above will probably provide more questions than answers – but hopefully it will give you an idea of the power of regular expressions. Read on to understand what it all means!
Before we dive into regular expressions further, it’s important to mention a few of the basic tools that utilise regular expressions, and their common use cases.
sed
grep
pcregrep
grep -P
Even just knowing the expressions above, we can construct some very useful queries. For example, let’s say we are parsing a file with the following contents:
Monday Green Tuesday Weather Wednesday Thursday Friday Saturday Sunday Pinecone
If we wanted to extract the words ending with “day”, we could use the following regex:
.*day$
Let’s go through what this actually does:
When we put it all together we get “Zero or more characters followed by ‘day’ followed by the end of a line”. Below is an image of the results in regex101.com. The blue highlights indicate a matched string.
Looking good! But if we are ingesting more dynamic data, there’s a good chance this will break. One of the issues with this method is that it matches any line ending with “day”, not just single words. For example:
Of course, this is not ideal, because we are wanting to extract only weekdays.
If we wanted to be more specific and not include the rest of the line, we could try something like this:
[a-z]*day$
Except something annoying will happen:
There are two problems here:
We can solve the first issue by switching off case sensitivity by preceding the expression with (?i). For example:
(?i)
We could also have explicitly included uppercase letters in the search, which would have the same result:
We still have the second problem though, where “day” is chosen over “Saturday”. You might think that removing the $ at the end of our regex would fix this problem, but it doesn’t quite work because now both “Saturday” and “day” are selected:
One way to solve this problem would be to use a different quantifier. Currently the quantifier we are using is *. Here is a list of quantifiers:
So we could solve this issue by using the “+” quantifier instead of the “*” quantifier. This would mean that “day” would not be matched, because it does not have an alphabetical character preceding it. Remember that + is the same as * except that + requires one or more instance of the preceding expression, while * requires zero or more. Let’s try out the +:
Great! The word “day” is no longer matched by itself. Now we have another issue though. Words like “Happyday” or “Holiday” would still be matched.
We can match one expression OR another expression by using the OR operator, for example (this|that) would match “this” or “that”. To match all of the weekdays explicitly, we could do something like this:
(this|that)
Yes! Finally it is working! Now we can just add case insensitivity and we’re all done! Wait.. what is this?:
We’re getting warmer, but we’re still not there.
To fix this, we want to make sure that none of the characters directly before the weekday name are letters or numbers. We can do this by using the NOT operator, which is the ^ character. For example:
Now “Summonday” is not selected, but we have another two issues to solve:
So how can we check if the preceding character is not an alphabet character without actually selecting it?
To achieve this, we need to use what is known as a “negative lookbehind”. It is called a “lookbehind” because it checks the expression immediately before the match, but without actually including it as part of the match. Before we dive into negative lookbehinds, let’s give an example of a positive lookbehind.
The syntax is:
(?:<thing you want to check for>)<rest of expression>
For example:
As you can see, “hacker” is only matched if it is preceded with an “a”. A negative lookbehind does exactly the opposite, and the syntax is:
(?<!thing you want to check for>)<rest of expression>
The example above is a “negative lookbehind”, so the word hacker will match as long as it is NOT preceded by an “a”.
So! To solve the issues with our original problem, we can use the following:
(?<![a-zA-Z0-9_])(?i)(Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day
This will match any weekday that is not preceded by a-z, A-Z 0-9 or an underscore _.
And here is the outcome:
At a glance, writing custom regex may appear to be a great solution for implementing various types of input validation. Above, while learning some basic regex, we have demonstrated that there are many edge cases that need to be accounted for. As such, it is generally recommended that using regex for input validation is avoided in favour of allowlists where possible.
To demonstrate why this is generally a bad idea, let’s say that we need to validate that a URL belongs to either bugcrowd.com, or a subdomain of bugcrowd.com. If the regex matches the URL, we will allow it, otherwise we will deny it. What regex should we use?
Many developers make the mistake of using something like this:
(https://|http://).*bugcrowd.com.*
Of course, this can be easily bypassed by prepending or appending text that results in a completely different domain, such as “https://notbugcrowd.com” or “https://bugcrowd.com.hakluke.com”.
If you do ever run into a situation where your input is being checked or filtered in some way, through trial and error, see if you can figure out how it is being filtered. There’s a chance that regex has been used, and that means there’s a chance that it can be bypassed. Get creative!
As a hacker, I most often use regex for filtering large chunks of text to parse out useful information. For example, below we curl https://www.bugcrowd.com and then use some regex magic to extract all of the URLs out of the response:
The regular expression I used here is:
(http://|https://).*?(?="|'| )
The tool that I used is “pcregrep”. It is basically the same as “grep” except it uses the PCRE regular expression library, which has more features. If you’re using GNU grep, you should be able to use grep -P instead.
At Bugcrowd, we post these kinds of how-to articles fairly frequently! If you’d like to learn more, you can join our Discord, follow us on Twitter, or check out our video content on YouTube.
If you’d like to see more from the author personally, follow hakluke on Twitter, YouTube, Instagram or check out his website.