The Definitive Guide to PDF Forensics & Malware Analysis
Understanding the PDF Threat Landscape
PDF files have evolved from simple documents into attack vectors.
1. Malicious JavaScript Execution
PDFs can contain embedded JavaScript that executes automatically upon opening. While valid for form validation, attackers leverage this to:
- Heap Spraying: Manipulating memory to prepare for an exploit.
- Payload Execution: Downloading binaries from remote servers.
- Reader Exploitation: attacking vulnerabilities in Adobe Reader or Foxit.
2. Embedded Malware and Droppers
The PDF specification allows embedding various file formats (.exe, .js, .docx). These “Droppers” rely on social engineering to trick users into saving and opening the attached malicious file, often bypassing email gateway scanners that inspect the PDF but miss the embedded object.
3. Business Email Compromise & Forgery
Not all malicious PDFs contain code. Invoice fraud is a massive industry. Attackers intercept legitimate PDF invoices and modify:
- Bank account numbers (IBAN/SWIFT).
- Payment addresses.
- Totals or line items. This requires forensic analysis to detect visual inconsistencies or modification metadata rather than malware analysis.
Prerequisites & Tooling
To follow this guide, you will need a Linux environment. The specialized tools mentioned below are pre-installed on the REMnux distribution, or can be downloaded from Didier Stevens suite.
- Standard Utils:
md5sum,file,strings,binwalk - Didier Stevens Suite:
pdfid.py,pdf-parser.py - Metadata Tools:
exiftool,pdfinfo
Step-by-Step Analysis Methodology
Phase 1: Triage and Identification
Before touching the file structure, perform a safety check and identify the file.
1. Calculate Hashes Always document your evidence.
md5sum suspicious.pdf
sha256sum suspicious.pdf
2. Validate File Magic
Ensure the file is actually a PDF and not an executable renamed to .pdf.
file suspicious.pdf
# Look for: "PDF document, version 1.x"
head -c 20 suspicious.pdf
# Look for header: %PDF-1.x
3. Safety Copy Never analyze the original evidence file.
cp suspicious.pdf working_copy.pdf
Phase 2: Metadata & History
Metadata can reveal the “story” of the document.
Check timestamps and creator tools:
pdfinfo suspicious.pdf
exiftool suspicious.pdf
- Red Flag: A “CreationDate” that is newer than the “ModDate”.
- Red Flag: A document claiming to be a “Bank of America Invoice” but created with a tool like “PwnPDF” or a generic library like “iText” (often used programmatically).
Phase 3: Structural Overview (pdfid)
Use pdfid.py to get a high-level view of the PDF’s internal objects without executing any code.
pdfid.py suspicious.pdf
The “Dirty Dozen” Keywords to watch:
| Keyword | Risk | Description |
|---|---|---|
/JavaScript | High | Indicates script execution capabilities. |
/JS | High | Abbreviation for JavaScript. |
/AA / /OpenAction | High | Actions that trigger automatically upon opening. |
/Launch | Critical | Can launch external executables. |
/EmbeddedFile | Medium | Indicates files explicitly attached to the PDF. |
Phase 4: Deep Dive (pdf-parser)
If pdfid flags suspicious keywords, use pdf-parser.py to examine the specific objects.
1. Search for Specific Keywords
# Locate the object ID for JavaScript elements
pdf-parser.py -s /JavaScript suspicious.pdf
2. Inspect an Object
Once you have an Object ID (e.g., 10), inspect its content:
# View Object 10
pdf-parser.py -o 10 suspicious.pdf
3. Handling Compressed Streams
Malware authors almost always compress malicious streams/scripts using /FlateDecode to hide them from strings/grep. You must decompress (-f) them.
# Dump and decompress Object 10 to stdout
pdf-parser.py -o 10 -f suspicious.pdf
Phase 5: Extraction & Deobfuscation
Extracting JavaScript
Attackers often obfuscate JS to hide shellcode.
- Extract the raw JS:
pdf-parser.py -o [object_id] -f -w suspicious.pdf > malicious.js - Deobfuscate:
- Look for
eval()functions. - Replace
eval()withconsole.log()to print the payload instead of running it. - Run the script in a safe, isolated Node.js environment or a browser sandbox (like JSFiddle).
- Look for
Extracting Embedded Files (Droppers)
If you found /EmbeddedFile:
- Locate the object:
pdf-parser.py -s /EmbeddedFile suspicious.pdf - Extract the payload:
# -d dumps the data to a file pdf-parser.py -o [object_id] -f -d dropped_malware.bin suspicious.pdf - Analyze the dropped file: Run
file,strings, or uploaddropped_malware.binto VirusTotal (if your TLP allows).
Phase 6: Fraud & Document Manipulation
For Business Email Compromise and fake invoices, the goal is not to find code, but to find edits.
1. Incremental Updates PDFs can be updated without rewriting the whole file. These are appended to the end of the file.
# Check if there are multiple versions/updates
pdf-parser.py -a suspicious.pdf | grep "incremental"
2. Text Analysis Extract text to see if hidden layers exist or if numbers don’t match the image.
pdftotext -layout suspicious.pdf -
3. Visual Anomaly Detection
- Are the bank details in a different font than the rest of the invoice?
- Is the bank logo pixelated while the rest of the text is crisp? (Implies a copy-paste job).
Practical Challenges & Solutions
Challenge 1: Multi-layer Encoding
Attackers may encode a stream with Hex, then Base64, then Zip. Solution: Dump the raw object and use tools like CyberChef to bake the layers, or pipeline command line tools:
cat payload.txt | base64 -d | xxd -r > decoded.bin
Challenge 2: Objects Hidden in Object Streams
Wait, pdfid showed 0 objects, but the file size is huge?
Solution:
Modern PDFs use “Object Streams” to compress the PDF structure itself. pdf-parser handles this, but you may need to dump the statistics to see it:
pdf-parser.py -a suspicious.pdf
Challenge 3: Encryption
If the PDF is password protected, you cannot analyze the structure easily.
Solution:
If the password is generic (empty) or known, create a decrypted version using qpdf:
qpdf --password='' --decrypt input.pdf decrypted.pdf
Conclusion
PDF analysis requires a methodical approach. By combining triage (pdfid) with parsing (pdf-parser), analysts can uncover both automated malware and manual fraud.
While static analysis is the safest starting point, always remember that determined attackers use obfuscation that may require dynamic analysis in a secure sandbox (like Any.Run or a dedicated malware lab) to fully understand the payload’s behavior.