Skip to content

1.1 The Software Pipeline

When you look at a computer application, a game, or a script, you are looking at the final product of a digital assembly line. To a computer, code isn’t a collection of words—it is a series of microscopic electrical switches turning on and off.

As a reverse engineer, your job is to look at that finished product and figure out how it was built. To do that, you first need to understand The Software Pipeline: the journey code takes from human-readable text into machine action.

We will use Python—a highly popular, readable programming language—to see how this works step-by-step.


Before diving into the pipeline, it helps to understand the two languages being spoken:

  1. High-Level Language (Human-Friendly): This is code written in plain text using English words and simple math (like Python). It is easy for us to write, read, and design, but a computer chip cannot execute it directly.
  2. Low-Level Language (Machine-Friendly): This is the native tongue of your computer’s processor (CPU). It consists entirely of raw data, binary states, and basic hardware instructions. It is incredibly difficult for humans to read, but it runs at lightning speed.

The software pipeline is the bridge that translates high-level text into low-level instructions.

Let’s look at a basic Python script. Do not worry about memorizing the syntax; just focus on the intent. This code takes a radius, calculates the area of a circle, and prints it out:

main.py
FACTOR = 2
def calculate_area(radius):
result = 3.14159 * (radius ** FACTOR)
print(f"Calculated Area: {result}")
return result
calculate_area(5)

Here is exactly how the computer processes this text file from start to finish.


Stage 1: Reading and Parsing (Tokenization)

Section titled “Stage 1: Reading and Parsing (Tokenization)”

When you tell your computer to run main.py, the Python interpreter opens the file. But a computer can’t understand a whole paragraph at once. It has to read it character by character, just like you read a sentence word by word.

First, the engine breaks the long string of text into individual units called Tokens. Think of this like taking a sentence and separating it into nouns, verbs, and punctuation.

  • It identifies def as a keyword meaning “a function is starting.”
  • It identifies calculate_area as a unique name.
  • It identifies * and ** as math instructions.

Next, it arranges these tokens into a structural blueprint called an Abstract Syntax Tree (AST). This is exactly like sentence diagramming from school. It checks if your code follows proper grammatical rules. If you forgot a colon or misspelled a core command, the blueprint fails here, and you get a SyntaxError.

If we peak behind the scenes to see how Python organizes our mathematical formula into a logical tree structure, it looks like this:

# Simplified view of the code's structural blueprint
Assignment:
├── Target: FACTOR
└── Value: 2
Function Definition:
├── Name: calculate_area
├── Input: radius
└── Math Operation: Multiply
├── Left Side: 3.14159
└── Right Side: Power of (radius to the power of FACTOR)

Stage 2: The Secret Translation (Bytecode)

Section titled “Stage 2: The Secret Translation (Bytecode)”

Once the computer verifies that your code’s grammar is perfect, it translates that structural tree into Bytecode.

Bytecode is a highly optimized, compact version of your program. It strips away your descriptive variable names, removes spaces, and compresses your logic into tight, universal instructions.

  • The Cache: Python values efficiency. To avoid re-translating your text file every single time you click run, it automatically saves this bytecode inside a hidden folder on your computer called __pycache__. These files end in .pyc.

If you try to open a .pyc file in a regular text editor, you will see completely unreadable, corrupted binary text. However, reverse engineers use tools to turn those raw bytes back into readable steps called mnemonics.

Here is what the math step (3.14159 * (radius ** FACTOR)) actually looks like when translated into Python Bytecode:

# The low-level instructions for our formula
1. LOAD_CONST (Load the number 3.14159 onto the work desk)
2. LOAD_FAST (Grab the 'radius' value provided by the user)
3. LOAD_GLOBAL (Grab the 'FACTOR' value we set earlier)
4. BINARY_OP (Apply the power-of '**' operation to radius and FACTOR)
5. BINARY_OP (Multiply '*' the result with 3.14159)
6. STORE_FAST (Save the final answer into a temporary slot called 'result')

Now the instructions are perfectly prepared, but they still aren’t running on your physical computer chip yet. They are written for a software-based engine called the Python Virtual Machine (PVM).

Think of the PVM like a gaming console emulator running on your PC. The bytecode acts like the game cartridge—it holds instructions designed specifically for that virtual system.

The PVM acts as the ultimate supervisor:

  1. It reads the bytecode instructions one line at a time.
  2. It manages a temporary calculation space (called a stack) to handle values.
  3. It hands the finalized tasks over to your actual physical CPU and operating system (Windows, Mac, or Linux) to physically display the words on your monitor screen or save data to your hard drive.

Why did we learn this instead of just writing code? Because reverse engineering is the exact opposite of this entire pipeline.

The Developer's Path: Idea ──> Source Code (.py) ──> Bytecode (.pyc) ──> Execution
The Reseacher's Path: Unknown App ──> Bytecode (.pyc) ──> Analysis ──> Reconstruct Logic

When you look at malware, audit a closed-source application, or look at a compiled asset, you rarely have access to the original high-level main.py source text file. Usually, you only find the raw bytecode or binaries.

By understanding this pipeline, you know that you don’t have to guess what an app does. You can use tools to capture the intermediate bytecode, read the individual execution steps, and recreate the original blueprints of the software completely from scratch!