Machine talk: The long road to understanding payload obfuscation

Lawdy, even that subtitle was long. If you decided to skip straight to bug bounty and not bother with nerd things like computer science or networking like I did—well, it turns out you need to know these things.

This article will be a long one, but compared to getting a college degree, it’s nothing. Also, it’s free (you’re welcome). So, time to dust the cobwebs off the portions of your brain responsible for studying. First, we’ll cover the background knowledge you need to tackle the topics in this article.

Then, and only then—in the second part of this series—will we cover tricks to sneak your payloads past the cyberguards.

(Source: Me)

Binary

This year, Nvidia unveiled its Blackwell B200, a single microchip that has 208 billion transistors. Yes, billion. Still not in awe? What if I told you it is roughly the size of a Toaster Strudel®? How many years do you think are equivalent to 208 million seconds? 6.58 years. Okay, now how many years is 208 billion seconds? 6,577.68 years. Neat. But why is this such a big deal?

Transistors in a computer’s microchip act as switches to create binary code. Electronic devices, such as computers, do not actually understand English, German, or Russian. What they do understand is binary, which consists of two numbers: 0 and 1. Just like a light switch, a transistor can be set to either on (1) or off (0) by controlling the flow of electricity.

So, the more light switches, the more processing power to do cool things.

The numbers you are most familiar with belong to the decimal base-10 system, which uses ten numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. In base-10, the position of a number in a multidigit number relates to its value, and the value increases by multiples of ten from right to left. Although you already knew all this, to visualize it, let’s use the decimal number 101:

1	0	1
100th place	10th place	1st place

Binary code is a base-2 system (since, again, binary consists of just 0s and 1s). You may be wondering how then a computer can understand numbers besides 0, 1, and combinations of 0s and 1s.

A bit, short for binary digit, is the smallest unit of data in computing. A bit can store either (you guessed it) a 0 or a 1. A single-digit binary number consists of a series of eight of these bits. This group of 8 bits is called a byte.

With 8 bits comprising one of two values, the number of possible unique combinations is 28, which amounts to 256. Let’s take our example decimal number of 101 from earlier and show how it is represented as a byte:

0	1	1	0	0	1	0	1
128	64	32	16	8	4	2	1

You may have noticed that the position of a bit increases by a multiple of two from right to left. If you add up all the position values that have their bit set to 1 (64 + 32 + 4 + 1), you get 101 again.

You may notice that if you add up all the position values, you only get a maximum number of 255 when 28 is 256. Don’t forget to include the value of 0.

How are numbers greater than 255 created then? By using multiple bytes of course. Using decimal number 256 as an example, here’s how it would be represented in bytes:

0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
32,768	16,348	8,192	4,096	2,048	1,024	512	256	128	64	32	16	8	4	2	1

If you add all the position values for 2 bytes, you get a maximum number of 65,535. Again, don’t forget to include the value of 0.

Be aware that, in binary, leading zeros can be excluded when you want to express a number in its simplest form. For example, the decimal value of 3 in binary is 00000011. If you chose to express it without the leading zeros, it would just be 11. If you wanted to indicate that a 3-bit or 4-bit representation is being used, it would be 011 or 0011, respectively. But I will just write out all 8 bits throughout this paper because I am not LAZY.

IPv4

You may have heard of the term “IP address.” If you are unsure about what this refers to, keep reading. If you are confident you understand the concept, feel free to skip ahead.

Think of the Internet Protocol (IP) as the postal system of the Internet. Each device that is connected to the internet receives a unique, public IP address that is used to identify it. These addresses achieve the same purpose as yours and your friend’s mailing addresses for sending each other letters or packages. The difference here is that between devices, data is sent instead, and this data is in packets.

There are multiple versions of the IP, including IPv4 and IPv6. IPv4 is the older version of the protocol where addresses consist of 4 bytes. With 4 bytes, you would get 232 = 4,294,967,296.

IPv4 addresses are represented as decimal numbers in four sections, each of which is known as an octet. For example, what is known as the localhost (the IP address that refers to the device you are currently using) has an IPv4 address of 127.0.0.1. This address in binary is:

01111111.00000000.00000000.00000001

Byte 1	Byte 2	Byte 3	Byte 4
01111111	00000000	00000000	00000001

A computer network is a group of interconnected computers and devices that communicate with each other to share resources, such as data and files. With an internet connection, all the networks across the world create a global network.

But wait, there are over eight billion people on Earth, with many of them owning multiple devices that can access the internet. Are 4,294,967,296 IP addresses enough? Actually, no.

IPv4 was established decades ago, before anyone could predict that the internet would be as ubiquitous as it is today. In 2019, the last address was handed out. This means that now, all IPv4 addresses are essentially recycled once they are released by their most recent holder and allocated to members on a waiting list.

So how then does your new smartphone or Internet of Things (IoT) device, like a home security camera that also dispenses dog treats, send and receive data packets online? Are you internet royalty who gets prioritized on the IP address waiting list? I’m sorry to break it to you, but no.

NAT

The introduction of the Network Address Translation (NAT) protocol saved the day. Actually, it had become widespread by 2004 because as the number of internet users kept growing, so did the concern that we would eventually run out of IPv4 addresses. NAT allows multiple devices on the same network to share a single public IP address, specifically the address of the device connected to the internet. For your home network, a public IP address is assigned to your router, and everything behind your router is considered to be your “local” network. A router is also referred to as a default gateway, as it is the gateway to the wonderful World Wide Web. These devices behind your router receive private IP addresses. The device you are reading this article on can actually have the same private IP address as many other devices across the world. But this doesn’t matter because those devices are behind their own routers as well.

The different local network private IP address ranges that are used include:

10.0.0.0 to 10.255.255.255—This range is often used in large organizations that require a significant number of addresses due to the number of internal computers and devices they have.
172.16.0.0 to 172.31.255.255—This range is also used in medium or large-sized networks but offers fewer addresses than the previously mentioned range.
192.168.0.0 to 192.168.255.255—This range is commonly used in home networks.

Subnetting

Local networks can be considered a form of subnetting. Subnetting involves dividing a network into smaller subnetworks, referred to as subnets.

Let’s use a home network as an example since this is probably what you are most familiar with. While each octet holds the number range of 0 to 255, there are three private IP addresses a subnet reserves:

192.168.0.0—The very first address is known as the network address and identifies the subnet itself.
192.168.0.1—By default, this is the address of your router, though it can be configured to have a different last octet.
192,168.0.255—This is the broadcast address. When a device sends a data packet to this address, all devices on the same subnet receive the packet.

Besides these three addresses, all the others are available to be allocated to computers and devices on the same subnet. This process is automatically handled by the Dynamic Host Control Protocol (DHCP).

Again, the private IP address range of a home network is 192.168.0.0 to 192.168.255.255. This means you can have 256 subnets with 256 addresses each (minus the three previously mentioned). But what if you want more addresses on the same subnet? If you run the terminal command ifconfig in Linux/MacOS or ipconfig in Windows, you will see an address associated with your netmask or subnet mask depending on your operating system. I bet the mask is 255.255.255.0, which in binary is:

11111111.11111111.11111111.00000000

The octets of all 1s indicate that they are part of a network and subnet address and are not used to identify a device. So, if you need more device addresses on the same subnet for whatever reason, you can change the mask to essentially unlock more. For example:

11111111.11111111.11111110.00000000

In the above mask, another bit position is unlocked. So now, instead of 8 for a total of 256 addresses, it is 9 for a total of 512 addresses allocated to the subnet. You may have seen IP address ranges represented as 192.168.0.0/24. This is the Classless Inter-Domain Routing (CIDR) representation and indicates how many bits starting from the left-hand side of an address are part of a network/subnet address. So, with a mask of 255.255.255.0, the network/subnet address is 192.168.0, with the device portion of the address being .0 to .255 (again, minus the network, broadcast, and router addresses).

ASCII

Okay, now you understand how computers interpret decimal numbers using binary bits and bytes. But how do they interpret letters? Well, back in ye olden days, different computer manufacturers represented characters in their own way. This meant that different makes and models of computers were unable to communicate with each other.

This is where encoding comes in, as it is the process of converting one type of data into another, allowing for a standardized representation of characters.

Designed in the 1960s, the American Standard Code for Information Interchange (ASCII) is an encoding standard that assigns a unique number to 128 different characters. This set includes both printable and unprintable characters.

Printable characters are numbers, letters, and symbols.

The ones that cannot be printed are control characters, such as carriage return (CR) and line feed (LF), which are used to mark the end and beginning of a line of text. New page (aka form feed) was used for printing. Bell (BEL) made your computer beep. You get the idea.

Be aware that due to historical reasons, the order of character groupings skips ahead and is ugly.

ASCII Control Characters
Decimal	Binary	Character	Name
0	00000000	NUL	Null
1	00000001	SOH	Start of heading
2	00000010	STX	Start of text
3	00000011	ETX	End of text
4	00000100	EOT	End of transmission
5	00000101	ENQ	Enquiry
6	00000110	ACK	Acknowledge
7	00000111	BEL	Bell
8	00001000	BS	Backspace
9	00001001	HT	Horizontal tab
10	00001010	LF	New line
11	00001011	VT	Vertical tab
12	00001100	FF	New page
13	00001101	CR	Carriage return
14	00001110	SO	Shift out
15	00001111	SI	Shift in
16	00010000	DLE	Data link escape
17	00010001	DC1	Device control 1
18	00010010	DC2	Device control 2
19	00010011	DC3	Device control 3
20	00010100	DC4	Device control 4
21	00010101	NAK	Negative acknowledgement
22	00010110	SYN	Synchronous idle
23	00010111	ETB	End of transmission block
24	00011000	CAN	Cancel
25	00011001	EM	End of medium
26	00011010	SUB	Substitute
27	00011011	ESC	Escape
28	00011100	FS	File separator
29	00011101	GS	Group separator
30	00011110	RS	Record separator
31	00011111	US	Unit separator
…
127	01111111	DEL	Delete

The decimal and binary values of symbol characters are:

ASCII Symbol Characters
Decimal	Binary	Character
32	00100000	(Space)
33	00100001	!
34	00100010	“
35	00100011	#
36	00100100	$
37	00100101	%
38	00100110	&
39	00100111	‘
40	00101000	(
41	00101001	)
42	00101010	*
43	00101011	+
44	00101100	,
45	00101101	–
46	00101110	.
47	00101111	/
…
58	00111010	:
59	00111011	;
60	00111100	<
61	00111101	=
62	00111110	>
63	00111111	?
64	01000000	@
…
91	01011011	[
92	01011100	\
93	01011101	]
94	01011110	^
95	01011111	_
96	01100000	`
123	01111011	{
124	01111100	\|
125	01111101	}
126	01111110	~

To save myself from having to make another table for numbers and letters, note the following rules (you can convert them to binary if you want to practice—no I am not just being lazy…okay I am):

Digits 0 to 9 have a decimal range of 48 to 57.
Uppercase letters A to Z have a decimal range of 65 to 90.
Lowercase letters a to z have a decimal range of 97 to 122.

You may have noticed that only 7 bits are used. This caused a wave of disagreement about what characters should be assigned to the numbers 128 to 255. All encoding ambassadors around the world eventually agreed to not touch the ASCII table. But this 8th bit was used for different purposes since different languages use different characters. All these different character assignments for the free 128 characters made available by that last bit are known as code pages. If your computer interprets data using one code page while the data was encoded using another—well, we now have the interoperability issue again, don’t we?

Unicode

Officially released in 1991, Unicode sought to solve this character mismatch problem once and for all. It dreamed of being a universal character encoding system that can accommodate all characters from different languages. And it can—with space left over for more.

Again, since 1 byte only allows for 256 different characters, Unicode uses multiple bytes to solve this issue. The number of bytes used is easily identifiable (by their bit number) in the most commonly used Unicode encoding formats: UTF-8, UTF-16, and UTF-32.

Unicode uses code points to identify characters. These begin with U+ to let everyone know that Unicode is being used in a formal or documentation context. For example, the code point for the letter A is:

U+0041

While in programming, the \u prefix is used to represent Unicode characters as their literal value. For example, in JavaScript, \u0041 is interpreted as the letter A. So, when JavaScript comes across \u0041 in a string, it knows to convert that escape sequence (which is the term used for these special codes that represent a character) into an A.

But how are those last four characters determined?

Back to the bases: Hexadecimal

The prefix “hexa” means six. Combine that with “decimal” and it means sixteen. The hexadecimal base-16 system uses sixteen characters: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F. Each character represents a decimal value, with 0 to 9 representing themselves and A to F representing 10 to 15, respectively. Each hexadecimal character can also be represented as 4 bits since each doesn’t utilize the bit positions of 5 to 8.

Hexadecimal Character	Decimal Value	Binary	4-bit Binary Representation
0	0	00000000	0000
1	1	00000001	0001
2	2	00000010	0010
3	3	00000011	0011
4	4	00000100	0100
5	5	00000101	0101
6	6	00000110	0110
7	7	00000111	0111
8	8	00001000	1000
9	9	00001001	1001
A	10	00001010	1010
B	11	00001011	1011
C	12	00001100	1100
D	13	00001101	1101
E	14	00001110	1110
F	15	00001111	1111

Keep in mind that hexadecimal encoding is not case-sensitive, meaning the six letters used can be either lowercase or uppercase. However, it is common to see them capitalized in Unicode, to make it pretty or whatever.

In programming languages, the 0x prefix is used to let everyone know that the hexadecimal format for a number is being used. For example, 0x41 will be interpreted as its decimal equivalent of 65. The escape sequence prefix used for hexadecimal is \x. So, \x41 will be interpreted as the letter A.

Hexadecimal is commonly used to represent a byte’s binary value in a compact way. For example:

Decimal	Hexadecimal	Binary byte	Character	Name
0	00	00000000		Null
9	09	00001001		Tab
10	0A	00001010		Line feed
13	0D	00001101		Carriage return
32	20	00100000		Space
34	22	00100010	“	Double quote
38	26	00100110	&	Ampersand
39	27	00100111	‘	Single quote
60	3C	00111100	<	Lesser than
61	3D	00111101	=	Equals
62	3E	00111110	>	Greater than
65	41	01000001	A

If you’ve been bug hunting, you may recognize a lil’ sumthin’ sumthin’ by now. But be patient.

To convert a decimal number into its hexadecimal equivalent, you have to see how many times you can fit the number 16 in it and use the remainder. We will use the letter A as an example:

The decimal value tied to the letter A is 65. Divide this number by 16.
65 / 16 = 4 with a remainder of 1.
Now take the quotient of 4 and divide by 16 again.
4 / 16 = 0 with a remainder of 4.
In hexadecimal, you take the remainder values and write them from the last to the first, going left to right, which results in 41.

For characters that only require a single byte, to determine the last four characters after the U+ in Unicode, you use this decimal-to-hexadecimal conversion and pad with leading zeros if necessary.

Back to the bases: Octal

The octal base-8 system uses the numerical digits 0 to 7.

The prefix of just 0 is used to let everyone know that a number is in octal format. For example, 075 represents the decimal number 61. The escape prefix is \0 or just \. For example, \061 would be interpreted as 1 and \101 would be interpreted as the letter A.

To convert a decimal number to its octal equivalent, you have to see how many times you can fit the number 8 in it and again use the remainder. Using the letter A as an example again:

The decimal value tied to the letter A is 65. Divide this number by 8.
65 / 8 = 8 with a remainder of 1.
Now take the quotient of 8 and divide by 8.
8 / 8 = 1 with a remainder of 0.
Now take the quotient of 1 and divide by 8.
1 / 8 with a remainder of 1.
As is the same with decimal to hexadecimal, you take the remainder values and write them from the last to the first, going left to right, which results in 101.

UTF-8

In UTF-16, there was yet another argument over its two variants, but we won’t get into that. For now, just know that UTF-8 solved it and that’s why it’s a hero and more popular.

Universal Transformation Format 8-bit (UTF-8) is the most commonly used Unicode encoding standard. This Unicode encoding format uses 1 byte for the first 128 characters of ASCII, but it can expand up to 32 bits if necessary. This is known as variable length encoding, which saves storage space, as most of the time, you are going to use the ASCII characters, which only require 7 bits.

If the number of bytes required to represent a character in UTF-8 is greater than one, there are different bit rules for the bytes:

Unicode Code Point Range	Number of bytes	1st byte Starts with	Following bytes Start with
U+0000 to U+007F	1
U+0080 to U+07FF	2	110xxxxx	10xxxxxx
U+0800 to U+FFFF	3	1110xxxx	10xxxxxx
U+10000 to U+10FFFF	4	11110xxx	10xxxxxx

As an example, let’s encode the symbol € in UTF-8 with a Unicode code point of U+20AC:

Hexadecimal Character	4-bit Binary Representation
2	0010
0	0000
A	1010
C	1100

The code point U+20AC in binary is: 0010 0000 1010 1100.
The code point is within the 3 byte range of U+0800 to U+FFFF.
The first byte starts with 1110, so with the first 4 bits of the code point, it becomes 1110 0010, creating a full byte.
The second byte starts with 10, so using the next 6 bits of the code point, it becomes 1000 0010, creating another full byte.
Finally, the third byte also starts with 10, so using the remaining 6 bits of the code point, it becomes 1010 1100, creating another full byte.
Combining these 3 bytes, the binary representation in UTF-8 encoding is:

1110 0010 1000 0010 1010 1100

The hex character with a binary value of 1110 is E. The hex character with a binary value of 0010 is 2. So the first bit in UTF-8 is E2.
1000 0010 = 82
1010 1100 = AC
So, the UTF-8 encoded representation of the Euro symbol is: 0xE2 0x82 0xAC

Snack time and then attack time

Holy moly that was a lot. Take a break now, eat, and hydrate. Come back and read the rest once you are refreshed.

Welcome back!

With your foundational understanding of how IP addresses are made and how different encodings are interpreted, you are now ready to learn how they can be used to make your payloads mo’ spicy.

Continue on to learn about dotless, hexadecimal, octal, and combinations of them to create IP addresses that may bypass protection mechanisms. Additionally, you will learn how to use encoding to possibly smuggle your injection attack payloads past security defenses.

Go on, brave soldier, to Part II.

Tags:

Machine talk: The long road to understanding payload obfuscation

Binary

IPv4

NAT

Subnetting

ASCII

Unicode

Back to the bases: Hexadecimal

Back to the bases: Octal

UTF-8

Snack time and then attack time

Subscribe for updates

Products

Use cases

Industries

Why Bugcrowd

Company

For Hackers

Introducing Savant

Machine talk: The long road to understanding payload obfuscation

Binary

IPv4

NAT

Subnetting

ASCII

Unicode

Back to the bases: Hexadecimal

Back to the bases: Octal

UTF-8

Snack time and then attack time

More from the blog

Lessons from our Ask Me Anything with Trustpilot

Introducing Savant Vista: Know what you own. Know what’s at risk.

From volume to validated risk: KPIs that measure exploitability, impact, and fix velocity

Subscribe for updates