How to Identify a File Without an Extension
Published
When a file has no extension, a misleading extension, or a name like download with no suffix at all, identify it from the bytes. Most binary formats begin with a recognizable byte sequence called a magic number or file signature. That signature survives renaming, copying, and re-uploading, so it is a better starting point than the filename.
The short version:
- Read the first bytes of the file.
- Convert those bytes to hex.
- Compare the hex at the documented offset for each candidate format.
- If the signature is shared, inspect the container internals before making a final decision.
You can do the manual lookup with the magic byte lookup tool, or call /api/v1/identify when you already have header bytes in a script.
Read the header bytes
On macOS or Linux, xxd is usually the fastest way to inspect the start of a file:
xxd -l 32 unknown-file
The left column is the offset, the middle is hex, and the right column is a text preview. A PDF often starts like this:
00000000: 2550 4446 2d31 2e37 0a25 e2e3 cfd3 0a31 %PDF-1.7.%.....1
The first bytes are:
25 50 44 46 2D
Paste that into the lookup page and it resolves to PDF. The same value can be queried from code:
curl "https://filesignature.org/api/v1/identify?hex=25%2050%2044%2046%202D"
Compare bytes at the right offset
Many signatures start at offset 0, but not all of them. Offset means the number of bytes from the beginning of the file to the first byte of the signature.
| Format | Signature | Offset | Note |
|---|---|---|---|
| PNG | 89 50 4E 47 0D 0A 1A 0A |
0 | Full 8-byte PNG header |
25 50 44 46 2D |
0 | The ASCII text %PDF- |
|
| MP4 | 66 74 79 70 |
4 | The ftyp box type starts after a 4-byte size field |
| DICOM | 44 49 43 4D |
128 | DICM follows a 128-byte preamble |
| TAR | 75 73 74 61 72 00 |
257 | TAR stores ustar in the header block |
Do not scan the entire file for a signature unless the format specifically defines that behavior. A byte sequence found somewhere in the body can be ordinary data, not a file type marker.
Use a small script when you need repeatability
For triage, read a fixed header window and compare slices:
from pathlib import Path
SIGNATURES = {
"png": (0, bytes.fromhex("89 50 4E 47 0D 0A 1A 0A")),
"pdf": (0, bytes.fromhex("25 50 44 46 2D")),
"mp4": (4, bytes.fromhex("66 74 79 70")),
}
def identify(path: str) -> list[str]:
head = Path(path).read_bytes()[:512]
matches = []
for label, (offset, sig) in SIGNATURES.items():
if head[offset:offset + len(sig)] == sig:
matches.append(label)
return matches
The 512-byte read covers most common header checks, including MP4 and DICOM. It does not cover every possible offset in every format, and it does not inspect trailer-only formats.
Resolve shared signatures
The most common trap is treating a shared container signature as a final answer.
50 4B 03 04 means "this looks like a ZIP container." It does not prove the exact user-facing format. The same bytes can identify ZIP, DOCX, XLSX, PPTX, APK, JAR, EPUB, ODT, and many other formats. To separate them:
- DOCX: open the ZIP and look for
[Content_Types].xmlplus aword/directory. - XLSX: look for
[Content_Types].xmlplus anxl/directory. - PPTX: look for
[Content_Types].xmlplus appt/directory. - EPUB: look for a
mimetypefile containingapplication/epub+zip. - APK or JAR: inspect the manifest and expected archive paths.
For RIFF formats, the first four bytes are shared too. WEBP, WAV, and AVI can all start with 52 49 46 46. The distinguishing field is the form type at offset 8.
What to do when there is no match
A no-match result does not always mean the file is random or corrupt. It can mean:
- You copied too few bytes.
- The format uses a signature at a later offset.
- The file is encrypted, compressed, or wrapped by another container.
- The format is not in the current database.
- The file is plain text and has no reliable binary header.
If you are building an upload gate, no match should usually fail closed unless the user flow explicitly allows unknown files.
Practical workflow
For an unknown file:
- Capture 512 bytes from the start of the file.
- Paste them into the magic byte lookup, or send them to
/api/v1/identify. - Review every returned candidate, not only the first one.
- If the candidate is a shared container, inspect internal structure.
- If the file is untrusted, validate it against a strict allowlist before processing it.
Magic bytes are the first layer of identification. They tell you what the content appears to be, and they give you the right next parser or validation rule to use.
Frequently Asked Questions
Can I identify a file without its extension?
Yes. Read the first bytes of the file and compare them with known magic byte signatures. The extension is only a filename label; the signature is inside the file content.
How many bytes do I need to identify a file?
For many common formats, 4 to 16 bytes is enough. Some formats need more context because their signature starts at a nonzero offset or because they share a container signature such as ZIP.
Why does a lookup return many results for the same bytes?
Some formats share a container signature. DOCX, XLSX, JAR, EPUB, APK, and ZIP can all begin with 50 4B 03 04, so the header alone identifies the container family, not the exact document type.
Should I trust a browser or operating system file type label?
Treat it as a hint. Operating systems often infer file type from the extension or MIME metadata. For validation, inspect the bytes on the server.