Skip to content

Magic Bytes vs MIME Type vs File Extension

Published

File type detection usually involves three signals: the extension, the MIME type, and the magic bytes. They answer different questions and have different trust levels.

Signal Example Comes from Best use Main weakness
Extension .pdf Filename UX, routing, user expectations User-controlled and easy to rename
MIME type application/pdf HTTP header, OS, database, detector Content negotiation, storage metadata Often inferred or supplied by the client
Magic bytes 25 50 44 46 2D File content Server-side detection and validation Shared containers need deeper inspection

The right approach is not to choose one signal forever. Use each signal for the job it is good at, and decide which one wins when they disagree.

File extensions are labels

An extension is part of the filename. It is useful because people and operating systems understand it, but it is not proof of content.

Renaming report.exe to report.pdf changes the label only. The bytes at the beginning still start with the Windows executable signature 4D 5A, not the PDF signature 25 50 44 46 2D.

Use extensions for:

  • Displaying filenames.
  • Picking a default icon.
  • Choosing an expected allowlist rule.
  • Warning users when the name and content disagree.

Do not use extensions alone for upload security or automated processing.

MIME types are metadata

A MIME type is a text label such as image/png or application/pdf. It can come from an HTTP header, a database column, an operating system registry, a library, or a browser sniffing algorithm.

On upload, the most dangerous MIME value is the one in the multipart request. The client controls it:

Bash
curl -F "file=@script.php;type=image/png" https://example.com/upload

That request tells the server the file is image/png, even though the bytes can be PHP source. A browser may also infer a MIME type from extension or limited content sniffing. That can be useful for display, but it is not a validation boundary.

Use MIME types for:

  • HTTP Content-Type responses after you have validated or generated the file.
  • Database metadata that records your own detection result.
  • Matching downstream processors that expect MIME labels.

Do not treat client-supplied MIME values as proof.

Magic bytes are content signals

Magic bytes are byte sequences inside the file. They are often at offset 0, but some formats store their identifying bytes later.

Examples:

Format Magic bytes Offset MIME type
PDF 25 50 44 46 2D 0 application/pdf
PNG 89 50 4E 47 0D 0A 1A 0A 0 image/png
ZIP 50 4B 03 04 0 application/zip
MP4 66 74 79 70 4 video/mp4
DICOM 44 49 43 4D 128 application/dicom

Magic-byte matching is a stronger first check because it reads the content. You can try it directly with the lookup tool or query:

Bash
curl "https://filesignature.org/api/v1/identify?hex=89%2050%204E%2047%200D%200A%201A%200A"

When the signals disagree

For upload validation, treat disagreement as a warning or rejection.

Extension Client MIME Magic bytes Decision
.png image/png PNG signature Accept if PNG is allowed and parsing succeeds
.png image/png EXE 4D 5A Reject
.pdf application/pdf ZIP 50 4B 03 04 Reject or route to ZIP-family validation
.docx DOCX MIME ZIP 50 4B 03 04 Inspect ZIP internals before accepting
no extension missing PDF signature Accept only if PDF is in the allowlist

The magic bytes should usually decide the first parser to use. The extension and MIME type can then be checked for consistency.

Shared signatures require second-level checks

Magic bytes can identify the container but not always the exact format. The ZIP signature is the classic example. DOCX, XLSX, PPTX, APK, JAR, EPUB, and many other formats all use ZIP containers.

For these files, do two checks:

  1. Confirm the container signature.
  2. Inspect required internal files or fields.

Examples:

  • DOCX requires Open Packaging Convention metadata and a word/ directory.
  • XLSX requires an xl/ directory.
  • EPUB requires a mimetype file with application/epub+zip.
  • WEBP requires a RIFF header plus WEBP as the form type at offset 8.

This is why a file identification API can return multiple matches for one header. It is reporting all candidates that match the bytes you supplied.

Recommended validation order

For server-side validation:

  1. Normalize the filename and extension.
  2. Read enough header bytes for every format in your allowlist.
  3. Compare magic bytes at the documented offsets.
  4. For containers, inspect the internal structure.
  5. Confirm the detected type is allowed for that endpoint.
  6. Store your own detected type and MIME label.
  7. Serve the file with a safe Content-Type only after validation.

The extension is for humans. The MIME type is for metadata and HTTP. The magic bytes are the content-level starting point for detection.

Frequently Asked Questions

Which is most reliable: magic bytes, MIME type, or extension?

Magic bytes are usually the strongest first signal because they are part of the file content. Extensions and MIME types are useful metadata, but they can be missing, stale, or user-controlled.

Can a MIME type be trusted on upload?

No. Multipart upload Content-Type values are supplied by the client. Use them as hints for UX or routing, but validate the content bytes on the server.

Are magic bytes enough for security?

No. They are necessary for type detection, but shared containers, polyglot files, and malformed files require additional parsing, allowlists, and safe storage controls.

Why do databases list multiple MIME types for one extension?

Some formats have legacy, vendor, and standardized MIME labels. The bytes can identify the format while the MIME label varies by application or registry.