Skip to content

A class library for discovering the format of files. Useful when receiving data from end-users when you cannot always rely on the content type specified (or trust it).

License

Workshell/file-formats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Workshell.FileFormats

License NuGet Build Status

A class library for discovering the format of files. Useful when receiving data from end-users when you cannot always rely on the content type specified (or trust it).

The library is extensible, it can support further formats.

Installation

Stable builds are available as NuGet packages. You can install it via the Package Manager or via the Package Manager Console:

Install-Package Workshell.FileFormats

Usage

You can call FileFormat.Get(string) with a file name to attempt to disocver the format of a file, such as:

var format = FileFormat.Get(@"C:\Windows\explorer.exe");

You can also call FileFormat.Get(Stream) on a stream, such as:

var file = new FileStream(@"C:\Windows\explorer.exe", FileMode.Open, FileAccess.Read);
var format = FileFormat.Get(file);

Note that the Stream instance should support seeking as the stream will need to be rewound during the scanning process.

If a format could not be found then a simple null will be returned. If it is found then you can test the return type directly, such as:

if (format is PortableExecutableFormat)
{
    // Do something...
}

Formats can also take advantage of inheritance, for example the Office Open XML Word Document format has the following chain:

WordDocumentFormat -> OfficeZipFormat -> ZipFormat -> FileFormat

This means you can check further up the chain. For example:

if (format is ZipFormat)
{
    // Do something...
}

The above example would cover any file that's detected as being ZIP based, which includes Office Open XML and the Open Document Format.

Extending

In order to add support for a new format you first need to define a subclass of FileFormat which defines the content types and file extensions of matched data.

Take for example the PDF format:

public class PDFFormat : FileFormat
{
    private static readonly string[] _contentTypes => new[] 
    { 
        "application/pdf",
        "application/x-pdf"
    };
    private static readonly string[] _extensions => new[] { "pdf" };

    public PDFFormat() : base(_contentTypes, _extensions)
    {
    }

    public override int SortIndex => 10;
}

You will also note the SortIndex property. This is useful for files that count as two different types.

For example, Office Open XML file types are based on ZIP files, so they are both ZIP files and Office files. When scanning we want the OOXML format to rank higher than the ZIP format, so we give it a higher sort index.

Once you have your FileFormat class you then need to subclass FileFormatScanner which is used to perform the actual data scanning.

The FileFormatScanner class looks like this:

public abstract class FileFormatScanner
{
    public abstract FileFormat Match(FileFormatScanJob job);
}

It only has one method you need to override, FileFormatScanner.Match(FileFormatScanJob).

The FileFormatScanJob instance supplied to the method contains the first and last 4KB of the data being scanned and a stream. You use these to perform the analysis required to determine the format of the data.

The 7-Zip scanner for example looks like this:

public class SevenZipFormatScanner : FileFormatScanner
{
    private static readonly byte?[] Signature = new byte?[] { 0x37, 0x7A, 0xBC, 0xAF, 0x27, 0x1C };

    public SevenZipFormatScanner()
    {
    }

    public override FileFormat Match(FileFormatScanJob job)
    {
        if (FileFormatUtils.IsNullOrEmpty(job.StartBytes))
            return null;

        if (job.StartBytes.Length <= Signature.Length)
            return null;

        if (!FileFormatUtils.MatchBytes(job.StartBytes, Signature))
            return null;

        var fingerprint = new SevenZipFormat();

        return fingerprint;
    }
}

Take a look at the existing formats for reference implementations.

Recognised File Formats

Name Example Extension(s)
Archive Formats 7-Zip .7z
BZip .bz2
Cabinet .cab
GZip .gz
RAR .rar
Zip .zip
.zipx
Containers Java Archive .jar
NuGet Package .nupkg
Microsoft Installer Database *.msi
eBooks ePub .epub
Amazon/MobiPocket eBook .mobi
Executables Executable and Linkable Format (Linux etc) .axf
.bin
.elf
.o
.prx
.puff
.ko
.mod
.so
Mach-O (macOS, iOS etc) .o
.dylib
.bundle
Portable Executable (Windows, .NET) .exe
.dll
.ocx
.scr
.cpl
.sys
Graphics Bitmap .bmp
Graphics Interchange Format .gif
JPEG .jpg
.jpeg
Portable Network Graphic .png
Tagged Image File Format .tif
.tiff
Media 3GP .3gp
3GP2 .3g2
Adaptive Multi-Rate Audio .amr
.3ga
Advanced Systems Format (Windows Media) .asf
.wmv
.wma
Audio Video Interleaved .avi
Audio Interchange File Format .aiff
.aif
.aifc
Basic Audio .au
Dolby AC-3 .ac3
Flash Video .flv
Free Lossless Audio Codec .flac
MPEG-4 Part 14 (MP4) .mp4
Matroska .mkv
.mka
.mks
.mk3d
M4V .m4v
M4A .m4a
Ogg .ogg
.ogv
.oga
.ogx
.ogm
.spx
.opus
QuickTime .mov
.qt
WebM .webm
Wave Audio .wav
RealAudio .rm
.ram
Microsoft Office Access .mdb
.accdb
Excel Workbook, Template or Add-In .xls
.xlt
xla
Outlook Personal Storage Table .pst
Outlook Message *.msg
PowerPoint Presentation, Template, Slideshow or Add-In .ppt
.pot
.pps
.ppa
Publisher .pub
Visio Drawing, Template or Stencil .vsd
.vst
.vss
Word Document or Template .doc
.dot
Microsoft Office (OpenXML) Excel Add-In .xlam
Excel Binary Workbook .xlsb
Excel Workbook .xlsx
.xlsm
Excel Workbook Template .xtlx
.xltm
PowerPoint Add-In .ppam
PowerPoint Presentation .pptx
.pptm
PowerPoint Presentation Template .potx
.potm
PowerPoint Slideshow .ppsx
.ppsm
Visio Drawing .vsdx
.vsdm
Visio Drawing Template .vstx
.vstm
Visio Stencil .vssx
.vssm
Word Document .docx
.docm
Word Document Template .dotx
.dotm
OpenOffice Chart .odc
Chart Template .otc
Database .odb
Document .odt
Document Template .ott
Drawing .odg
Drawing Template .otg
Formula .odf
Formula Template .otf
Image .odi
Image Template .oti
Master Document .odm
Presentation .odp
Presentation Template .otp
Spreadsheet .ods
Spreadsheet Template .odt
OpenOffice (Flat) Document .fodt
Drawing .fodg
Presentation .fodp
Spreadsheet .fods
Unified Office Format Document .uot
.uof
Presentation .uop
.uof
Spreadsheet .uos
.uof
Others Animated Cursor .ani
Flash .swf
Portable Document Format .pdf
eXtensible Markup Language (XML) .xml

If you think there's a common format we should cover then please do let us know and we'll try and add support for it.

Attribution

We currently use a modified and internalised variant of OpenMCDF for reading some file formats, especially legacy Microsoft Office files.

The original version of OpenMCDF is available here: https://github.com/ironfede/openmcdf

The code remains copyright and licensed to those respective authors.

MIT License

Copyright (c) Workshell Ltd

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

A class library for discovering the format of files. Useful when receiving data from end-users when you cannot always rely on the content type specified (or trust it).

Resources

License

Stars

Watchers

Forks

Packages

No packages published