> Article

How to Analyze Suspicious PDFs

By Bearloggs
#forensics #malware-analysis

The Definitive Guide to PDF Forensics & Malware Analysis#

Understanding the PDF Threat Landscape#

PDF files have evolved from simple documents into attack vectors.

1. Malicious JavaScript Execution#

PDFs can contain embedded JavaScript that executes automatically upon opening. While valid for form validation, attackers leverage this to:

  • Heap Spraying: Manipulating memory to prepare for an exploit.
  • Payload Execution: Downloading binaries from remote servers.
  • Reader Exploitation: attacking vulnerabilities in Adobe Reader or Foxit.

2. Embedded Malware and Droppers#

The PDF specification allows embedding various file formats (.exe, .js, .docx). These “Droppers” rely on social engineering to trick users into saving and opening the attached malicious file, often bypassing email gateway scanners that inspect the PDF but miss the embedded object.

3. Business Email Compromise & Forgery#

Not all malicious PDFs contain code. Invoice fraud is a massive industry. Attackers intercept legitimate PDF invoices and modify:

  • Bank account numbers (IBAN/SWIFT).
  • Payment addresses.
  • Totals or line items. This requires forensic analysis to detect visual inconsistencies or modification metadata rather than malware analysis.

Prerequisites & Tooling#

To follow this guide, you will need a Linux environment. The specialized tools mentioned below are pre-installed on the REMnux distribution, or can be downloaded from Didier Stevens suite.

  • Standard Utils: md5sum, file, strings, binwalk
  • Didier Stevens Suite: pdfid.py, pdf-parser.py
  • Metadata Tools: exiftool, pdfinfo

Step-by-Step Analysis Methodology#

Phase 1: Triage and Identification#

Before touching the file structure, perform a safety check and identify the file.

1. Calculate Hashes Always document your evidence.

md5sum suspicious.pdf
sha256sum suspicious.pdf

2. Validate File Magic Ensure the file is actually a PDF and not an executable renamed to .pdf.

file suspicious.pdf
# Look for: "PDF document, version 1.x"

head -c 20 suspicious.pdf
# Look for header: %PDF-1.x

3. Safety Copy Never analyze the original evidence file.

cp suspicious.pdf working_copy.pdf

Phase 2: Metadata & History#

Metadata can reveal the “story” of the document.

Check timestamps and creator tools:

pdfinfo suspicious.pdf
exiftool suspicious.pdf
  • Red Flag: A “CreationDate” that is newer than the “ModDate”.
  • Red Flag: A document claiming to be a “Bank of America Invoice” but created with a tool like “PwnPDF” or a generic library like “iText” (often used programmatically).

Phase 3: Structural Overview (pdfid)#

Use pdfid.py to get a high-level view of the PDF’s internal objects without executing any code.

pdfid.py suspicious.pdf

The “Dirty Dozen” Keywords to watch:

KeywordRiskDescription
/JavaScriptHighIndicates script execution capabilities.
/JSHighAbbreviation for JavaScript.
/AA / /OpenActionHighActions that trigger automatically upon opening.
/LaunchCriticalCan launch external executables.
/EmbeddedFileMediumIndicates files explicitly attached to the PDF.

Phase 4: Deep Dive (pdf-parser)#

If pdfid flags suspicious keywords, use pdf-parser.py to examine the specific objects.

1. Search for Specific Keywords

# Locate the object ID for JavaScript elements
pdf-parser.py -s /JavaScript suspicious.pdf

2. Inspect an Object Once you have an Object ID (e.g., 10), inspect its content:

# View Object 10
pdf-parser.py -o 10 suspicious.pdf

3. Handling Compressed Streams Malware authors almost always compress malicious streams/scripts using /FlateDecode to hide them from strings/grep. You must decompress (-f) them.

# Dump and decompress Object 10 to stdout
pdf-parser.py -o 10 -f suspicious.pdf

Phase 5: Extraction & Deobfuscation#

Extracting JavaScript#

Attackers often obfuscate JS to hide shellcode.

  1. Extract the raw JS:
    pdf-parser.py -o [object_id] -f -w suspicious.pdf > malicious.js
  2. Deobfuscate:
    • Look for eval() functions.
    • Replace eval() with console.log() to print the payload instead of running it.
    • Run the script in a safe, isolated Node.js environment or a browser sandbox (like JSFiddle).

Extracting Embedded Files (Droppers)#

If you found /EmbeddedFile:

  1. Locate the object:
    pdf-parser.py -s /EmbeddedFile suspicious.pdf
  2. Extract the payload:
    # -d dumps the data to a file
    pdf-parser.py -o [object_id] -f -d dropped_malware.bin suspicious.pdf
  3. Analyze the dropped file: Run file, strings, or upload dropped_malware.bin to VirusTotal (if your TLP allows).

Phase 6: Fraud & Document Manipulation#

For Business Email Compromise and fake invoices, the goal is not to find code, but to find edits.

1. Incremental Updates PDFs can be updated without rewriting the whole file. These are appended to the end of the file.

# Check if there are multiple versions/updates
pdf-parser.py -a suspicious.pdf | grep "incremental"

2. Text Analysis Extract text to see if hidden layers exist or if numbers don’t match the image.

pdftotext -layout suspicious.pdf -

3. Visual Anomaly Detection

  • Are the bank details in a different font than the rest of the invoice?
  • Is the bank logo pixelated while the rest of the text is crisp? (Implies a copy-paste job).

Practical Challenges & Solutions#

Challenge 1: Multi-layer Encoding#

Attackers may encode a stream with Hex, then Base64, then Zip. Solution: Dump the raw object and use tools like CyberChef to bake the layers, or pipeline command line tools:

cat payload.txt | base64 -d | xxd -r > decoded.bin

Challenge 2: Objects Hidden in Object Streams#

Wait, pdfid showed 0 objects, but the file size is huge? Solution: Modern PDFs use “Object Streams” to compress the PDF structure itself. pdf-parser handles this, but you may need to dump the statistics to see it:

pdf-parser.py -a suspicious.pdf

Challenge 3: Encryption#

If the PDF is password protected, you cannot analyze the structure easily. Solution: If the password is generic (empty) or known, create a decrypted version using qpdf:

qpdf --password='' --decrypt input.pdf decrypted.pdf

Conclusion#

PDF analysis requires a methodical approach. By combining triage (pdfid) with parsing (pdf-parser), analysts can uncover both automated malware and manual fraud.

While static analysis is the safest starting point, always remember that determined attackers use obfuscation that may require dynamic analysis in a secure sandbox (like Any.Run or a dedicated malware lab) to fully understand the payload’s behavior.

References & Further Reading#