Skip to content

How to Identify a File Without an Extension

Published

When a file has no extension, a misleading extension, or a name like download with no suffix at all, identify it from the bytes. Most binary formats begin with a recognizable byte sequence called a magic number or file signature. That signature survives renaming, copying, and re-uploading, so it is a better starting point than the filename.

The short version:

  1. Read the first bytes of the file.
  2. Convert those bytes to hex.
  3. Compare the hex at the documented offset for each candidate format.
  4. If the signature is shared, inspect the container internals before making a final decision.

You can do the manual lookup with the magic byte lookup tool, or call /api/v1/identify when you already have header bytes in a script.

Read the header bytes

On macOS or Linux, xxd is usually the fastest way to inspect the start of a file:

Bash
xxd -l 32 unknown-file

The left column is the offset, the middle is hex, and the right column is a text preview. A PDF often starts like this:

text
00000000: 2550 4446 2d31 2e37 0a25 e2e3 cfd3 0a31  %PDF-1.7.%.....1

The first bytes are:

text
25 50 44 46 2D

Paste that into the lookup page and it resolves to PDF. The same value can be queried from code:

Bash
curl "https://filesignature.org/api/v1/identify?hex=25%2050%2044%2046%202D"

Compare bytes at the right offset

Many signatures start at offset 0, but not all of them. Offset means the number of bytes from the beginning of the file to the first byte of the signature.

Format Signature Offset Note
PNG 89 50 4E 47 0D 0A 1A 0A 0 Full 8-byte PNG header
PDF 25 50 44 46 2D 0 The ASCII text %PDF-
MP4 66 74 79 70 4 The ftyp box type starts after a 4-byte size field
DICOM 44 49 43 4D 128 DICM follows a 128-byte preamble
TAR 75 73 74 61 72 00 257 TAR stores ustar in the header block

Do not scan the entire file for a signature unless the format specifically defines that behavior. A byte sequence found somewhere in the body can be ordinary data, not a file type marker.

Use a small script when you need repeatability

For triage, read a fixed header window and compare slices:

Python
from pathlib import Path

SIGNATURES = {
    "png": (0, bytes.fromhex("89 50 4E 47 0D 0A 1A 0A")),
    "pdf": (0, bytes.fromhex("25 50 44 46 2D")),
    "mp4": (4, bytes.fromhex("66 74 79 70")),
}

def identify(path: str) -> list[str]:
    head = Path(path).read_bytes()[:512]
    matches = []
    for label, (offset, sig) in SIGNATURES.items():
        if head[offset:offset + len(sig)] == sig:
            matches.append(label)
    return matches

The 512-byte read covers most common header checks, including MP4 and DICOM. It does not cover every possible offset in every format, and it does not inspect trailer-only formats.

Resolve shared signatures

The most common trap is treating a shared container signature as a final answer.

50 4B 03 04 means "this looks like a ZIP container." It does not prove the exact user-facing format. The same bytes can identify ZIP, DOCX, XLSX, PPTX, APK, JAR, EPUB, ODT, and many other formats. To separate them:

  • DOCX: open the ZIP and look for [Content_Types].xml plus a word/ directory.
  • XLSX: look for [Content_Types].xml plus an xl/ directory.
  • PPTX: look for [Content_Types].xml plus a ppt/ directory.
  • EPUB: look for a mimetype file containing application/epub+zip.
  • APK or JAR: inspect the manifest and expected archive paths.

For RIFF formats, the first four bytes are shared too. WEBP, WAV, and AVI can all start with 52 49 46 46. The distinguishing field is the form type at offset 8.

What to do when there is no match

A no-match result does not always mean the file is random or corrupt. It can mean:

  • You copied too few bytes.
  • The format uses a signature at a later offset.
  • The file is encrypted, compressed, or wrapped by another container.
  • The format is not in the current database.
  • The file is plain text and has no reliable binary header.

If you are building an upload gate, no match should usually fail closed unless the user flow explicitly allows unknown files.

Practical workflow

For an unknown file:

  1. Capture 512 bytes from the start of the file.
  2. Paste them into the magic byte lookup, or send them to /api/v1/identify.
  3. Review every returned candidate, not only the first one.
  4. If the candidate is a shared container, inspect internal structure.
  5. If the file is untrusted, validate it against a strict allowlist before processing it.

Magic bytes are the first layer of identification. They tell you what the content appears to be, and they give you the right next parser or validation rule to use.

Frequently Asked Questions

Can I identify a file without its extension?

Yes. Read the first bytes of the file and compare them with known magic byte signatures. The extension is only a filename label; the signature is inside the file content.

How many bytes do I need to identify a file?

For many common formats, 4 to 16 bytes is enough. Some formats need more context because their signature starts at a nonzero offset or because they share a container signature such as ZIP.

Why does a lookup return many results for the same bytes?

Some formats share a container signature. DOCX, XLSX, JAR, EPUB, APK, and ZIP can all begin with 50 4B 03 04, so the header alone identifies the container family, not the exact document type.

Should I trust a browser or operating system file type label?

Treat it as a hint. Operating systems often infer file type from the extension or MIME metadata. For validation, inspect the bytes on the server.