Hi again. This is part 2 of the footprint-shrinking post. If you missed it, part 1 is HERE.
We will pick up right where we left off.
Here’s where we are so far.
And here’s our code.
#include <Windows.h>
#include <winternl.h>
struct R_RTL_USER_PROCESS_PARAMETERS
{
ULONG MaximumLength;
ULONG Length;
ULONG Flags;
ULONG DebugFlags;
HANDLE ConsoleHandle;
ULONG ConsoleFlags;
HANDLE StandardInput;
HANDLE StandardOutput;
HANDLE StandardError;
BYTE CurrentDirectory[0x18];
UNICODE_STRING DllPath;
UNICODE_STRING ImagePathName;
UNICODE_STRING CommandLine;
PWSTR Environment;
ULONG StartingX;
ULONG StartingY;
ULONG CountX;
ULONG CountY;
ULONG CountCharsX;
ULONG CountCharsY;
ULONG FillAttribute;
ULONG WindowFlags;
ULONG ShowWindowFlags;
UNICODE_STRING WindowTitle;
UNICODE_STRING DesktopInfo;
UNICODE_STRING ShellInfo;
UNICODE_STRING RuntimeData;
BYTE CurrentDirectories[0x300];
SIZE_T EnvironmentSize;
SIZE_T EnvironmentVersion;
};
constexpr int StrCmp(const char *str1, const char *str2)
{
while (*str1 && *str2)
{
if (*str1 != *str2)
return *str1 - *str2;
str1++;
str2++;
}
return *str1 - *str2;
}
void* ParseEAT(void* dllbase, const char *func)
{
PIMAGE_DOS_HEADER dos = (PIMAGE_DOS_HEADER)dllbase;
PIMAGE_NT_HEADERS nt = (PIMAGE_NT_HEADERS)((BYTE*)dos + dos->e_lfanew);
if (nt->Signature != IMAGE_NT_SIGNATURE)
return nullptr;
DWORD exportrva = nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress;
// No export table
if (exportrva == 0)
return nullptr;
PIMAGE_EXPORT_DIRECTORY exports = (PIMAGE_EXPORT_DIRECTORY)((BYTE*)dos + exportrva);
DWORD* names = (DWORD*)((BYTE*)dos + exports->AddressOfNames);
USHORT* ordinals = (USHORT*)((BYTE*)dos + exports->AddressOfNameOrdinals);
DWORD* functions = (DWORD*)((BYTE*)dos + exports->AddressOfFunctions);
for (DWORD i = 0; i < exports->NumberOfNames; i++)
{
if (!StrCmp((const char*)((BYTE*)dos + names[i]), func))
return (void*)((BYTE*)dos + functions[ordinals[i]]);
}
return nullptr;
}
void* GetPEBFunction(const char *func)
{
PPEB peb = NtCurrentTeb()->ProcessEnvironmentBlock;
PPEB_LDR_DATA ldr = peb->Ldr;
for (PLIST_ENTRY entry = ldr->InMemoryOrderModuleList.Flink; entry != &ldr->InMemoryOrderModuleList; entry = entry->Flink)
{
PLDR_DATA_TABLE_ENTRY mod = CONTAINING_RECORD(entry, LDR_DATA_TABLE_ENTRY, InMemoryOrderLinks);
if (mod->DllBase == nullptr)
continue;
void* address = ParseEAT(mod->DllBase, func);
if (address != nullptr)
return address;
}
return nullptr;
}
int mainCRTStartup(PPEB peb)
{
struct R_RTL_USER_PROCESS_PARAMETERS *params = (struct R_RTL_USER_PROCESS_PARAMETERS *)peb->ProcessParameters;
HANDLE stdout = params->StandardOutput;
using tNtWriteFile = NTSTATUS(NTAPI *)(HANDLE, HANDLE, PVOID, PVOID, PVOID, PVOID, ULONG, PLARGE_INTEGER, PVOID);
const char ntwritefile[] = {'N', 't', 'W', 'r', 'i', 't', 'e', 'F', 'i', 'l', 'e', '\0'};
tNtWriteFile NtWriteFile = (tNtWriteFile)GetPEBFunction(ntwritefile);
IO_STATUS_BLOCK iosb;
const char hello[] = {'H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!', '\n'};
NtWriteFile(stdout, NULL, NULL, NULL, &iosb, (PVOID)hello, 13, NULL, NULL);
return 0;
}
There are a couple more optimizations to be had here. We can switch optimization to favor size to shave off a few more bytes.
C/C++ → Optimization → Optimization: Maximum Optimization (Favor Size) (/O1)
Now, there is pretty much nothing else we can do from the compiler side to shrink this program (minus optimizing the code a bit). After this, we have to remove bytes by hand.
Remember, all that’s left is the exception directory. Fortunately, we can remove this. But, we have to do it by hand.
Firstly, you need to understand the presentation of exception data in a PE file. We will worry about stripping the header out later.
For each function (that is capable of undergoing exception handling), there is a 12 byte entry in the .pdata section. This is the structure definition of the entry:
struct RUNTIME_FUNCTION
{
DWORD FunctionStart; // RVA (file offset) of the function
DWORD FunctionEnd; // RVA of the function (where the first byte after the function lies)
DWORD UnwindInfo; // RVA of the unwind info
};
These all are listed sequentially at the start of the .pdata section. Since the compiler optimized everything into a single function, there is only 1 of these in this example. As you can see in the screenshot, the virtual size of the section is 0xC or 12 bytes, the size of the structure. The raw size is 0x10 or 16, because remember that the section alignment is set to 16.
So, of course, we want to remove that entire sequence of bytes. But, remember that the UnwindInfo field points to the unwind info. This info lies in the .rdata (which is now the .text) section. So, we can also strip this out as well. In this example, this unwind info is at the end of the section. I’m not sure if this is always the case, but this is super fortunate, as the unwind info can also be removed. Let’s look at the structure of the unwind info.
#pragma pack(push, 1)
struct UNWIND_CODE
{
BYTE CodeOffset;
BYTE UnwindOp : 4;
BYTE OpInfo : 4;
};
struct UNWIND_INFO
{
BYTE Version : 3;
BYTE Flags : 5;
BYTE SizeOfProlog;
BYTE CountOfCodes;
BYTE FrameRegister : 4;
BYTE FrameOffset : 4;
UNWIND_CODE UnwindCode[1]; // A C trick to extend the size of this structure
};
#pragma pack(pop)
To be honest, we don’t really need to fully understand how exceptions work, but we can discuss a few things. Exception exploitation might be the topic of a future post.
Similar to RUNTIME_FUNCTION in the .pdata, these entries are set sequentially, although they are dynamically sized. We can acquire the size of the entry by calculating the size of the structure + CountOfCodes * 2 - 2.
We don’t really need to get the size of the structure or do any parsing. We can just cut off everything beyond the start of the RVA that the first RUNTIME_FUNCTION’s UnwindInfo structure points to.
So, since this structure starts at 0x390, everything beyond 0x390 can be removed from the program. Ergo, the program will be 912 bytes in size, removing 48 bytes. Also! See those 3 0xCC alignment bytes? Those can be removed. Now, the program is 0x38D bytes in size (909).
Let’s throw together some Python code that does this. The pefile module is a blessing here as it can do most of the parsing for us. Like I said, all we need to do is acquire the RVA of the first unwind info and then remove all the bytes after it. The ctypes library is also useful here as it helps translate bytes into Python classes.
If you’re cool, you can add this as a post build script straight from VS.
import pefile
import ctypes
import sys
class RUNTIME_FUNCTION(ctypes.Structure):
_fields_ = [
("BeginAddress", ctypes.c_ulong),
("EndAddress", ctypes.c_ulong),
("UnwindInfoAddress", ctypes.c_ulong)
]
def main():
if len(sys.argv) != 2:
print("Usage: python rip.py <path to executable>")
return
with open(sys.argv[1], "rb") as f:
data = f.read()
pe:pefile.PE = pefile.PE(data=data)
# Get start of .pdata section
for section in pe.sections:
if section.Name.decode().strip("\x00") == ".pdata":
pdata_start = section.VirtualAddress
break
runtime_func = RUNTIME_FUNCTION.from_buffer_copy(data[pdata_start:])
text_len = runtime_func.UnwindInfoAddress - pe.sections[0].VirtualAddress
# Strip everything after the start of the unwind data
# (which should hopefully be at the end of the .text section)
newdata = data[:runtime_func.UnwindInfoAddress]
while newdata[-1] == 0xCC:
newdata = newdata[:-1]
print(f"Size of new .exe = {len(newdata):#X} ({len(newdata)})")
with open(f"{sys.argv[1].removesuffix('.exe')}_new.exe", "wb") as f:
f.write(newdata)
if __name__ == '__main__':
main()
This works, and we have a 909-byte executable. But, the .pdata information is still in the header. This needs to be removed.
First, we need to get rid of the directory from the header. This is just 8 bytes that has the RVA of the section and its size. Zeroing this out is essentially the same as removing it.
Next, the actual section header needs to be removed. This can’t exactly be “removed”, but it can be zeroed out which would convince any disassembler or PE explorer that the zeroed bytes are just alignment bytes.
After the section header is removed, the header should be updated to reflect that it only has 1 singular section. This is in the IMAGE_FILE_HEADER.
Lastly, the size of the newly updated .text section should be updated to reflect the removal of the bytes. The size of the image in the optional header should be updated too.
Fortunately, pefile lets us do all of this with relative ease.
import pefile
import ctypes
import sys
class RUNTIME_FUNCTION(ctypes.Structure):
_fields_ = [
("BeginAddress", ctypes.c_ulong),
("EndAddress", ctypes.c_ulong),
("UnwindInfoAddress", ctypes.c_ulong)
]
def main():
if len(sys.argv) != 2:
print("Usage: python rip.py <path to executable>")
return
with open(sys.argv[1], "rb") as f:
data = f.read()
pe:pefile.PE = pefile.PE(data=data)
# Get start of .pdata section
for section in pe.sections:
if section.Name.decode().strip("\x00") == ".pdata":
pdata_start = section.VirtualAddress
# Zero out the section
# There are a lot of fields so I'm zeroing all of them
section.Name = b"\x00" * 8
section.Characteristics = 0
section.PointerToRawData = 0
section.PointerToRawData_adj = 0
section.VirtualAddress = 0
section.VirtualAddress_adj = 0
section.SizeOfRawData = 0
section.Misc_VirtualSize = 0
section.Misc = 0
section.Misc_PhysicalAddress = 0
break
pe.FILE_HEADER.NumberOfSections = 1
# Remove (zero) the exception directory
exception_directory = pe.OPTIONAL_HEADER.DATA_DIRECTORY[pefile.DIRECTORY_ENTRY["IMAGE_DIRECTORY_ENTRY_EXCEPTION"]]
exception_directory.VirtualAddress = 0
exception_directory.Size = 0
# Get the start of the unwind data
runtime_func = RUNTIME_FUNCTION.from_buffer_copy(data[pdata_start:])
# Strip off alignment bytes
sz = runtime_func.UnwindInfoAddress
while data[:sz][-1] == 0xCC:
sz -= 1
text_len = sz - pe.sections[0].VirtualAddress
# Set the new .text section size
pe.sections[0].SizeOfRawData = text_len
pe.sections[0].Misc_VirtualSize = text_len
pe.sections[0].Misc = text_len
pe.sections[0].Misc_PhysicalAddress = text_len
# Update image size
pe.OPTIONAL_HEADER.SizeOfImage = sz
# Acquire newly updated data
data = pe.write()
# Strip everything after the start of the unwind data
# (which should hopefully be at the end of the .text section)
newdata = data[:runtime_func.UnwindInfoAddress]
# Strip off alignment bytes (again)
while newdata[-1] == 0xCC:
newdata = newdata[:-1]
print(f"Size of new .exe = {len(newdata):#X} ({len(newdata)})")
with open(f"{sys.argv[1].removesuffix('.exe')}_new.exe", "wb") as f:
f.write(newdata)
if __name__ == '__main__':
main()
This produces a 909 byte executable. And there is no trace of anything that isn’t .text.
We can still shave off a few more bytes. But, here’s where things get very volatile.
Remember how we removed the .pdata section header? Well, we didn’t remove it. We just zeroed out its bytes. There are still 40 bytes there that we might be able to remove (+ any extra if the 16-byte alignment matches up).
However, if we actually remove any bytes before the executable section, we might run into a couple of issues.
The section RVA and raw address changes. The section start RVA and raw address can be found in .text section header and be subtracted by the amount of bytes removed.
The entrypoint RVA changes. The entrypoint RVA is found in the optional header. Subtract this too.
The size of the images changes again. So you would change that too.
Everything might explode. If there are any instructions that perform absolute addressing, you’re fucked. Okay maybe not entirely if you disassemble the program and fix the absolute addresses to point to the subtracted position, but that’s a lot of effort. x64 has RIP-relative addressing, which may be used instead of absolute addressing. Fortunately, I wrote this program in a manner so that there is no absolute addressing.
Fortunately, this program is basic enough to pull off some more shaving. Here’s that Python program again.
import pefile
import ctypes
import sys
class RUNTIME_FUNCTION(ctypes.Structure):
_fields_ = [
("BeginAddress", ctypes.c_ulong),
("EndAddress", ctypes.c_ulong),
("UnwindInfoAddress", ctypes.c_ulong)
]
def main():
if len(sys.argv) != 2:
print("Usage: python rip.py <path to executable>")
return
with open(sys.argv[1], "rb") as f:
data = f.read()
pe:pefile.PE = pefile.PE(data=data)
# Organize the code we just wrote
newdata = first_pass(pe, data)
pe = pefile.PE(data=newdata)
newdata = second_pass(pe, newdata)
print(f"Size of new .exe = {len(newdata):#X} ({len(newdata)})")
with open(f"{sys.argv[1].removesuffix('.exe')}_new.exe", "wb") as f:
f.write(newdata)
def first_pass(pe:pefile.PE, data):
# Get start of .pdata section
for section in pe.sections:
if section.Name.decode().strip("\x00") == ".pdata":
pdata_start = section.VirtualAddress
# Zero out the section
# There are a lot of fields so I'm zeroing all of them
section.Name = b"\x00" * 8
section.Characteristics = 0
section.PointerToRawData = 0
section.PointerToRawData_adj = 0
section.VirtualAddress = 0
section.VirtualAddress_adj = 0
section.SizeOfRawData = 0
section.Misc_VirtualSize = 0
section.Misc = 0
section.Misc_PhysicalAddress = 0
break
pe.FILE_HEADER.NumberOfSections = 1
# Remove (zero) the exception directory
exception_directory = pe.OPTIONAL_HEADER.DATA_DIRECTORY[
pefile.DIRECTORY_ENTRY["IMAGE_DIRECTORY_ENTRY_EXCEPTION"]]
exception_directory.VirtualAddress = 0
exception_directory.Size = 0
# Get the start of the unwind data
runtime_func = RUNTIME_FUNCTION.from_buffer_copy(data[pdata_start:])
# Strip off alignment bytes
sz = runtime_func.UnwindInfoAddress
while data[:sz][-1] == 0xCC:
sz -= 1
text_len = sz - pe.sections[0].VirtualAddress
# Set the new .text section size
pe.sections[0].SizeOfRawData = text_len
pe.sections[0].Misc_VirtualSize = text_len
pe.sections[0].Misc = text_len
pe.sections[0].Misc_PhysicalAddress = text_len
# Update image size
pe.OPTIONAL_HEADER.SizeOfImage = sz
# Acquire newly updated data
data = pe.write()
# Strip everything after the start of the unwind data
# (which should hopefully be at the end of the .text section)
newdata = data[:runtime_func.UnwindInfoAddress]
# Strip off alignment bytes (again)
while newdata[-1] == 0xCC:
newdata = newdata[:-1]
return newdata
def second_pass(pe:pefile.PE, data):
# First, update the data in the pe object
# Get the end of the header
size_of_header = pe.OPTIONAL_HEADER.SizeOfHeaders
# Fortunately! The last byte(s) of the IMAGE_SECTION_HEADER is the Characteristics field
# In .text, the last byte is 0x60
# Therefore, we can just go backwards until we hit it (or anything that isn't 0x00)
shift = 0
while data[size_of_header-1] == 0x00:
size_of_header -= 1
shift += 1
# Assure that this is aligned to 16 bytes
while size_of_header % 16 != 0:
size_of_header += 1
shift -= 1
# Update the header size, image size (again), and entrypoint
pe.OPTIONAL_HEADER.SizeOfHeaders = size_of_header
pe.OPTIONAL_HEADER.SizeOfImage -= shift
pe.OPTIONAL_HEADER.AddressOfEntryPoint -= shift
# Update the section header
pe.sections[0].PointerToRawData -= shift
pe.sections[0].PointerToRawData_adj -= shift
pe.sections[0].VirtualAddress -= shift
pe.sections[0].VirtualAddress_adj -= shift
newdata = pe.write()
# The data is still there, so we trim it
newdata = newdata[:size_of_header] + newdata[size_of_header+shift:]
return newdata
if __name__ == '__main__':
main()
Believe it or not, but I got this working first try.
877.
We’ve focused a lot on post-compilation, but what about the actual code we wrote. Is there a way to optimize that any further? Let’s start golfing.
Maybe, if we instead didn’t walk the PEB, what if we just invoked the underlying system call directly? That’s what NtWriteFile does under the hood, so let’s just do that instead. Unfortunately, we have to write this by hand in assembly. Oh well.
To invoke a syscall, you just MOV EAX, syscall_number
and then SYSCALL
. Of course, you have to set up your parameters beforehand.
The syscall number for NtWriteFile might change between Windows versions, but on mine, it’s 8.
If you’re following along, you’ll want to follow these instructions.
Here is our new .asm file.
PUBLIC HWWriteFile
.code
HWWriteFile PROC
mov r10, rcx
mov eax, 8 ; Number for NtWriteFile
syscall
ret
HWWriteFile ENDP
END
And the updated main.cpp
--snip--
extern "C" NTSTATUS NTAPI HWWriteFile(HANDLE, HANDLE, PVOID, PVOID, PVOID, PVOID, ULONG, PLARGE_INTEGER, PVOID);
int mainCRTStartup(PPEB peb)
{
struct R_RTL_USER_PROCESS_PARAMETERS *params = (struct R_RTL_USER_PROCESS_PARAMETERS *)peb->ProcessParameters;
HANDLE stdout = params->StandardOutput;
using tNtWriteFile = NTSTATUS(NTAPI *)(HANDLE, HANDLE, PVOID, PVOID, PVOID, PVOID, ULONG, PLARGE_INTEGER, PVOID);
IO_STATUS_BLOCK iosb;
const char hello[] = {'H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!', '\n'};
HWWriteFile(stdout, NULL, NULL, NULL, &iosb, (PVOID)hello, 13, NULL, NULL);
return 0;
}
Compile the program and run the Python script again and…
667
This seems like a good stopping point. To be honest, in the wild, no one writes complex programs like this in assembly. At least I haven’t seen any that do.
You could write the entire program in assembly yourself to perform a few more optimizations, but that’s overkill.
And by overkill I mean that I tried but the syscall kept failing. Oh well :)
I got it down to 600 bytes, but that was the best I could do.
Ignoring that failure, that’s all for this post. Until next time.
Go!
-BowTiedCrawfish