Skip to content

Malware Analysis

We have so far covered C and Assmebly, vulnerabilities that can be introduced, reverse engineering, application of low level techniques to find exploits, and buffer overflows in particular.

We can perform analysis of programs in different ways (statically and dynamically) to help find security issues or malware. We have already seen how to exploit buffer overflows, and how gdb moves the memory for these around a bit.

Code Review

By trying to overflow buffers by hand, we realise that it's not trivial to do. When we write secure software, we want to find a way to help us do this. One such tool to help is automated code review, which will automatically try and find bugs in the programs through static and dynamic analysis.

Static Analysis

This is the process of seeing what we can learn from the program without directly running it. This allows us to verify software, find the defects and problems within it and learn about it without actually running it.

Static analysis can help us find flow issues (e.g., infinite loops, unreachable code), security issues (e.g., buffer overflows or input validation errors), memory issues (e.g., uninitialized segments, null dereferences), input issues such as SQL injection and XSS, resource leaks, thread issues, and exceptions.

It is very useful for checking against pre-defined rules (e.g., make sure that variables are 100% always initialized after declaration before they are read). Static analysis doesn't solve every issue, however, as thanks to the decidability problem, we know that programs in general are undecidable (thanks Dr. Turing for pointing this one out).

Static analysis tools can be split into the following types: source code, bytecode, and binary code. The source code analysis performs pattern matching on the code to see if it matches a bad pattern. The other two do the same but on the different representations of code.

We might need to do the binary code analysis, when e.g., writing a compiler.

Example: sudo

If we try to manually audit the sudo program, we will very quickly run into issues. Looking at the assembly, we have huge amounts of instructions, which can become very tedious to analyse.

However, if we take a step back and abstract the problem, we can look at a higher order structure. For example, look at the strings contained within the program with strings, look at the function calls with objdump -T /usr/bin/sudo, the overall layout of the program, the libraries it uses, the compiler it used, etc.

All of this is possible on just the binary, but if we have access to the source code (oh, look! We do! Thanks open source!), then we can perform more analysis, by taking the source code and making it into an intermediate representation (e.g., LLVM IR).

How Static Analysis Tools Work

We first get the source code, executable, or bytecode of the program. We convert it into a universal intermediate representation, then using a set of rules, we pass it into an analyser. We can then see the results of this analysis in the outputs.

Static analysis tools can run alongside language servers on the client side to recommend rules to developers, which can then be enforced by security teams in an organisation at the source level, to reject commits which have common security vulnerabilities in them.

Intermediate Representations

We can either store the code as an abstract syntax tree, encoding how statements and expressions are nested to produce programs, or as control flow graphs describing the order in which code statements are executed and the conditions for that path to be taken.

ASTs

Abstract syntax trees take the control flow logical elements from the source code, leaving the rest (e.g., comments, variable names, etc.) out. Leaf nodes represent operands (e.g., variables or constants), with the inner nodes denoting different operators (e.g., addition or assignments)

CFGs

These are digraphs with each node representing a statement. The edges represent the control flow. Statements can be assignments or branches.

Analysis Techniques

We can do lexical analysis, which involves the AST representation of the source code, data flow analysis, which uses the CFG, or control flow, which also makes use of the CFG.

Lexical analysis looks for matches to known vulnerable functions. They are fast and cheap to implement, and are good for finding deprecated functions in use. They aren't aware of the context in which the function will be called, however. For example, they might warn against using strcpy but not check to see that we are performing size checks before copying.

Data Flow: Taint Propagation

Any input from an untrusted user is known as tainted. We can then warn against sending data to certain methods and constructs (sinks). If we then have a function that validates and sanitizes the data, we can call this an untainting operation, which then marks the data as clean again.

We can follow the dataflows through the program and determine if the tainted data can get to the vulnerable sink function. We have a source rule, a passthrough rule, and a sink rule. The source rule marks a variable as tainted, the passthrough variable shows which variables in a function might end up tainted, then the sink rule marks the function as somewhere that is not supposed to receive unstructured data.

Control Flow

We can look at how a program runs, and ensure that the same function isn't used twice (e.g., double free of memory), or use after free issues.

Tools

There are many open source tools which we can use to find vulnerabilities within programs, including FlawFinder, RATS, Clang's built in static analysis, and FindBugs. In addition to these free tools, there are many commercial tools which would likely be kept better up to date, including Fortify, Coverity, and CodeSecure.

Limitations

The issue with static analysis is that it can easily produce false positives, which lead to manual review. Sometimes tooling can also produce false negatives, which lead to bugs that aren't accounted for. We say that a tool is sound if it produces zero false negatives, and unsound if it reduces the false positives but lets false negatives through.

They can also lead to a false sense of security, as 'if static analysis says it's good, surely it's good right' could quickly become a problem. This is why manual code review and security researchers exist, to catch issues that we haven't yet trained computers to reliably catch themselves.

Static analysis tools also don't understand what an application is supposed to do, and the normal rules are for general security defects. Applications can also still have other issues, e.g., with authorization and trust. It has been said that static analysis only covers 50% of security defects.

Key Characteristics

We want static analysis tools to support multiple languages, be extensible, be useful for security analysts and developers, and support existing development processes.

Dynamic Analysis

This is a process to establish what a program does whilst it is running. We can achieve this by looking at the program state during its execution. This is one approach for verifying software, as we execute on specific inputs and see what the results are. E.g., functional testing, web application scanners, fuzzing, etc.

Detection of the vulnerabilities is integrated into the execution of the program through tooling known as instrumentation and sanitisers.

What to Find?

We can use dynamic analysis for runtime error detection, memory issues and leaks, I/O validation issues, pointer arithmetic issues, response time issues, functional issue and performance of the program. These tools are limited to what is being executed at runtime.

Dynamic analysis is often used to check test coverage, performance, memory usage, security properties, concurrency errors, and violations of rules. We run the program to gather information, collect the information, then analyse it.

Instrumentation

We add extra code to the program to aid with the runtime behaviour observations. This extra code is instrumentation code, which then calls external analysis code to collect data about the execution of an instruction or about its behaviour. Instrumentation can be injected at compile time (preferred), or can be done on compiled code.

Binary instrumentation has advantages though in that it is language independent, requires no recompilation or access to source, and all the code in the build is covered.

Tools

At compile time, we can make use of gprof to profile the code, gcov to get coverage analysis of the program, dmalloc to replace normal memory allocation with safe allocation that tracks memory leaks, out of bounds writes, etc. More commonly, these are becoming parts of compilers and enabled for debug builds (production builds could slow down the code too much for some sanitisers)

Valgrind

This is a tool suite to debug and profile an application. Tools such as memcheck, which is distributed as a part of this, detects lots of memory related errors in C and C++ that can cause crashes or lead to unpredictable behaviour.

Valgrind can be run easily to check for leaks with valgrind --leak-check=yes ./binary.

Example: Memory Leak

Take a toy program:

void f(void) {
  int* x = malloc(10*sizeof(int));
  x[10] = 0;
}

int main(void) {
  f();
  return 0;
}

When we run this with Valgrind, it will tell us off for not freeing the memory in the f function, and also tell us off for trying to write to index 10, when we only have indexes 0-9 available to us. The program can be fixed as follows:

void f(void) {
  int* x = malloc(11*sizeof(int));
  x[10] = 0;
  free(x);
}

int main(void) {
  f();
  return 0;
}

Pin

This is a binary instrumentation framework from Intel. It allows us to build Pintools for Windows and Linux platforms. These are written in C/C++, and can be used to monitor and record the behaviour of a program whilst running. They have instrumentation, analysis, and callback routines. Instrumentation lets us insert the routines, analysis is what performs the analysis, and callback functions are called when specific conditions are met.

Advantages and Disadvantages

We don't have false positives or false negatives because it tracks the actual state of the program. We don't need the source code for the tools to run, and we can also see the path taken to get to a crash. Dynamic analysis can also run on live code, which is important for security.

However, dynamic analysis can only detect vulnerabilities for a specific input data, and it takes lots of computation to perform analysis. As it is in-memory, we can't as easily point to the exact location of a vulnerability in the code.

Other Dynamic Tools

We also have network scanners, sniffers, vulnerability scanners; web application scanners; IDSes, firewalls and debuggers, which are all dynamic.