Validate File Types by Magic Bytes
Published
To validate a file's type by its magic bytes, read the first bytes of its contents and compare them against the known signature for each format you accept. This is reliable because the signature is written into the file by the program that created it; the filename extension and the upload Content-Type are labels supplied by whoever sends the file and can be changed in seconds.
What a magic byte (file signature) actually is
A magic byte sequence (also called a magic number or file signature) is a fixed sequence of bytes, usually near the start of a file, that identifies its format independent of the filename. Most signatures are short, typically 2 to 15 bytes.
You can verify these by opening any matching file in a hex editor:
| Format | Hex signature | ASCII |
|---|---|---|
| PNG | 89 50 4E 47 0D 0A 1A 0A |
‰PNG␍␊␚␊ |
25 50 44 46 |
%PDF |
|
| GIF | 47 49 46 38 |
GIF8 |
| ZIP | 50 4B 03 04 |
PK␃␄ |
A file declares its type in three places, and they are not equally trustworthy:
- The filename extension (
.pdf): text in the name, controlled entirely by the user. - The
Content-Type/ MIME header: on an HTTP upload, set by the client. - The signature in the bytes: written by the application that produced the file, and the only one of the three that survives copying, renaming, and re-uploading.
Signatures are matched byte-for-byte, not as text. 47 49 46 38 and 67 69 66 38 are different signatures even though one is the uppercase and one the lowercase of "gif8". A single differing byte is a different signature. Every format documented here lists its exact hex, offset, MIME type, and risk level on its detail page.
Why the file extension and Content-Type cannot be trusted
The extension lives in the filename, not in the file. Renaming malware.exe to photo.png changes zero bytes of content; the 4D 5A (MZ) header is still there. The double-extension trick, invoice.pdf.exe (which many systems display as invoice.pdf), relies entirely on people reading the name instead of the bytes.
The Content-Type on a multipart upload is no better. The client chooses it, and forging it is a one-line change to a curl request:
curl -F "file=@shell.php;type=image/png" https://example.com/upload
The server receives Content-Type: image/png for a PHP script. Treat the multipart Content-Type as a hint for naming or routing, never as proof of type. Validate by signature first; use the extension and MIME only as secondary, advisory signals.
How file-type detection works
The matching algorithm is straightforward:
- Read the first N bytes of the file, where N is the length of the longest signature you check (16–32 bytes covers most cases).
- For each format in your table, compare the relevant byte slice against its signature at the correct offset.
- Return the first match, or "unknown".
Read from a buffer or stream rather than loading the whole file. You usually need only the header, and occasionally the trailer; pulling a multi-gigabyte upload into memory to inspect 8 bytes is wasteful and a denial-of-service vector.
Offsets matter, because signatures are not always at byte 0. MP4 and HEIC carry ftyp (66 74 79 70) at offset 4, after a 4-byte box-size field. DICOM places DICM at offset 128, after a 128-byte preamble. A check that only looks at byte 0 will miss every one of these. Hundreds of documented signatures here sit at a nonzero offset.
Byte order matters. TIFF begins with either 4D 4D 00 2A (MM, big-endian) or 49 49 2A 00 (II, little-endian); both are valid TIFF. Text files may carry a byte-order mark: UTF-8 EF BB BF, UTF-16 LE FF FE, UTF-16 BE FE FF. Your table must hold both variants of any format that has them.
Some formats are identified by a trailer, not a header, with bytes at the end of the file. When the format requires it, read the tail as well as the head.
A minimal comparison of a buffer against a signature at a given offset:
def matches(buf: bytes, sig: bytes, offset: int = 0) -> bool:
return buf[offset:offset + len(sig)] == sig
Validating an upload against an allowlist
Allowlist the formats you accept; do not blocklist the ones you fear. You cannot enumerate every dangerous format, but you can enumerate the handful you actually support. The examples below validate the same four formats, PNG, JPG, PDF, and ZIP, so you can compare across languages.
Python
SIGNATURES = {
"png": (0, bytes([0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A])),
"jpg": (0, bytes([0xFF, 0xD8, 0xFF])),
"pdf": (0, b"%PDF"),
"zip": (0, bytes([0x50, 0x4B, 0x03, 0x04])),
}
def detect(path: str) -> str | None:
with open(path, "rb") as f:
head = f.read(32)
for label, (offset, sig) in SIGNATURES.items():
if head[offset:offset + len(sig)] == sig:
return label
return None
The standard-library imghdr was removed in Python 3.13, so do not reach for it. Either use explicit byte checks like the above, or a maintained library such as filetype (pure-Python, no native dependency) or python-magic (binds to libmagic, broader coverage but needs the system library). Libraries return a best-effort guess, not a guarantee.
Node.js
import { open } from "node:fs/promises";
const SIGNATURES = [
["png", 0, Buffer.from([0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a])],
["jpg", 0, Buffer.from([0xff, 0xd8, 0xff])],
["pdf", 0, Buffer.from("%PDF")],
["zip", 0, Buffer.from([0x50, 0x4b, 0x03, 0x04])],
];
export async function detect(path) {
const fh = await open(path, "r");
const { buffer } = await fh.read(Buffer.alloc(32), 0, 32, 0);
await fh.close();
for (const [label, offset, sig] of SIGNATURES) {
if (buffer.subarray(offset, offset + sig.length).equals(sig)) return label;
}
return null;
}
The file-type package and magic-bytes.js cover many formats out of the box. Both return a hint, not a verdict; treat the result as an input to your allowlist check, not the check itself.
Go
import "bytes"
type sig struct {
label string
offset int
bytes []byte
}
var signatures = []sig{
{"png", 0, []byte{0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A}},
{"jpg", 0, []byte{0xFF, 0xD8, 0xFF}},
{"pdf", 0, []byte("%PDF")},
{"zip", 0, []byte{0x50, 0x4B, 0x03, 0x04}},
}
func detect(head []byte) string {
for _, s := range signatures {
end := s.offset + len(s.bytes)
if len(head) >= end && bytes.Equal(head[s.offset:end], s.bytes) {
return s.label
}
}
return ""
}
The standard library's net/http.DetectContentType sniffs roughly the first 512 bytes against a small built-in set. It is convenient but covers a limited range of formats and returns MIME strings, not exact format names; for an allowlist beyond the common web types, maintain your own table.
Run all of this on the server. Client-side JavaScript validation improves the user experience but provides no security: an attacker posts directly to your endpoint and skips the page entirely.
Why a matching signature is necessary but not sufficient
Here is the caveat most tutorials omit. Many distinct formats share one signature because they are the same underlying container.
50 4B 03 04(PK) is the ZIP signature, shared by 35 documented formats including DOCX, XLSX, PPTX, APK, JAR, EPUB, IPA, ODT, and KMZ. They are all ZIP archives.D0 CF 11 E0 A1 B1 1A E1is the OLE/CFBF compound-file signature, shared by 41 legacy formats including DOC, XLS, PPT, MSI, and MSG.4D 5A(MZ) marks 25 Windows executable formats including EXE, DLL, SYS, SCR, CPL, OCX, and COM.52 49 46 46(RIFF) is shared by WAV, AVI, and WEBP.
Matching the header tells you the container, not the format. A .docx, a .jar, and a plain .zip are byte-for-byte identical in their first four bytes. To confirm a file is really a DOCX, you must open the ZIP and inspect its central directory for [Content_Types].xml and a word/ directory. To distinguish a DOC from an MSI, you read the OLE storage streams. For RIFF, read the 4-byte form type at offset 8 (WEBP, AVI , WAVE). This second-level inspection is the difference between a check that looks correct and one that is correct.
Magic-byte spoofing, polyglots, and defense-in-depth
Attackers construct polyglot files that satisfy two formats at once. One example is a valid GIF or PNG header followed by a complete ZIP archive or script. The file passes a naive header check and renders as an image, while a different parser downstream treats it as an archive or executable. Appended data is invisible to a check that only reads the first 32 bytes.
Be strictest with the formats flagged High risk here: executables and scripts such as EXE, DLL, JAR, JS, MSI, SCR, VBE, and WSF. These are the formats where a successful spoof leads directly to code execution.
Signature validation is required, but it is one layer. The OWASP File Upload Cheat Sheet is explicit that extension and Content-Type checks are insufficient, and that content validation must be combined with other controls:
- Serve uploads from a separate origin or sandbox; never include or execute an uploaded file.
- Assign randomized server-side filenames; do not trust the client name on disk.
- Enforce size limits before reading.
- Re-encode images through a trusted library, which discards appended payloads.
- For container documents, decompose and inspect the internal structure.
The honest goal is correct identification plus containment, not a single perfect gate.
Quick reference
| Format | Hex | ASCII | Offset | MIME |
|---|---|---|---|---|
25 50 44 46 |
%PDF |
0 | application/pdf |
|
| PNG | 89 50 4E 47 0D 0A 1A 0A |
‰PNG… |
0 | image/png |
| JPG | FF D8 FF |
— | 0 | image/jpeg |
| GIF | 47 49 46 38 |
GIF8 |
0 | image/gif |
| ZIP | 50 4B 03 04 |
PK␃␄ |
0 | application/zip |
| MP4 | 66 74 79 70 |
ftyp |
4 | video/mp4 |
| WEBP | 52 49 46 46 |
RIFF |
0 | image/webp |
| ELF | 7F 45 4C 46 |
␡ELF |
0 | application/x-elf |
| EXE | 4D 5A |
MZ |
0 | application/vnd.microsoft.portable-executable |
To look up any of the documented formats: browse and search the full index, open a per-format page for its signature and validation code, or query the JSON API at /api/v1/{ext} for programmatic lookups. Each format also has a Markdown representation at /{ext}.md for use as LLM or agent context.
Validate by signature, verify the internal structure of shared-signature containers, and treat magic bytes as one layer of a layered upload defense.
Frequently Asked Questions
What is the difference between a file signature and a MIME type?
A file signature is a sequence of bytes inside the file, written by the program that created it. A MIME type (such as image/png) is a text label that travels alongside the file, in an HTTP header or a database column. The signature is part of the content and survives renaming; the MIME label is metadata and can be wrong, missing, or deliberately spoofed.
Can two different file types have the same magic bytes?
Yes. Container-based formats share a signature. DOCX, XLSX, APK, JAR, and EPUB all begin with the ZIP signature 50 4B 03 04 because they are all ZIP archives. To tell them apart you must inspect the contents; for DOCX, look inside the archive for [Content_Types].xml and the word/ directory.
Is checking magic bytes enough to make file uploads safe?
No. It is necessary but not sufficient. A polyglot file can carry a valid image header with an appended executable payload. Combine signature validation with an allowlist, sandboxed storage, randomized filenames, size limits, image re-encoding, and a strict rule never to execute or include uploaded files.
Why does my file have a signature at offset 4 instead of 0?
Some formats place a length or box header before the type identifier. MP4 and HEIC store ftyp at offset 4, after a 4-byte box-size field; DICOM stores DICM at offset 128, after a 128-byte preamble. Check the documented offset for each format rather than assuming the signature begins at byte 0.
Why was imghdr removed and what should I use instead?
Python's imghdr module was removed in Python 3.13. Use explicit byte-slice comparisons against the signatures you accept, or a maintained library such as filetype (pure Python) or python-magic (libmagic bindings, broader format coverage).