GitHub - SergiyStoyan/PdfDocumentParser: PdfDocumentParser is a .NET toolset for building PDF parsers.

PdfDocumentParser

PdfDocumentParser is a parsing engine intended to find and extract text/images from PDF documents that conform to predictable graphic layouts - such as reports, bills, forms, tickets and the like. Its parsing approach is based on finding certain text or image fragments in page and then extracting text/images located relatively to those fragments.

PdfDocumentParser does all the tricky job of building parsing templates, search, recognition and extraction, thus, leaving you only to code a custom logic.

PdfDocumentParser is a .NET DLL.

For a usage example / framework refer to SampleParser project in the repository.

Known issues

because it is WinForm, GUI may appear mangled in UHD display (or otherwise in FHD, depending on version). Don't be afraid: you can open it in VS and tune for your resolution. WPF version is in stalled developement...

Documentation

Here

Support

Contact me if you want me to enhance PdfDocumentParser. Also, you can hire me for solving a parsing task of any complexity or for general development.

Name		Name	Last commit message	Last commit date
Latest commit History 947 Commits
CliverRoutines		CliverRoutines
CliverWinRoutines		CliverWinRoutines
Properties		Properties
SampleParser		SampleParser
Settings		Settings
docs		docs
docs_files		docs_files
externals		externals
.gitattributes		.gitattributes
.gitignore		.gitignore
3RINGS~1.ICO		3RINGS~1.ICO
AboutBox.Designer.cs		AboutBox.Designer.cs
AboutBox.cs		AboutBox.cs
AboutBox.resx		AboutBox.resx
AnchorControl.Designer.cs		AnchorControl.Designer.cs
AnchorControl.cs		AnchorControl.cs
AnchorCvImageControl.Designer.cs		AnchorCvImageControl.Designer.cs
AnchorCvImageControl.cs		AnchorCvImageControl.cs
AnchorCvImageControl.resx		AnchorCvImageControl.resx
AnchorImageDataControl.Designer.cs		AnchorImageDataControl.Designer.cs
AnchorImageDataControl.cs		AnchorImageDataControl.cs
AnchorImageDataControl.resx		AnchorImageDataControl.resx
AnchorOcrTextControl.Designer.cs		AnchorOcrTextControl.Designer.cs
AnchorOcrTextControl.cs		AnchorOcrTextControl.cs
AnchorOcrTextControl.resx		AnchorOcrTextControl.resx
AnchorPdfTextControl.Designer.cs		AnchorPdfTextControl.Designer.cs
AnchorPdfTextControl.cs		AnchorPdfTextControl.cs
AnchorPdfTextControl.resx		AnchorPdfTextControl.resx
BitmapPreprocessor.cs		BitmapPreprocessor.cs
BooleanEngine.cs		BooleanEngine.cs
Compiler.cs		Compiler.cs
CvImage.cs		CvImage.cs
Deskewer.cs		Deskewer.cs
FieldControl.Designer.cs		FieldControl.Designer.cs
FieldControl.cs		FieldControl.cs
FieldControl.resx		FieldControl.resx
FieldImageControl.Designer.cs		FieldImageControl.Designer.cs
FieldImageControl.cs		FieldImageControl.cs
FieldImageControl.resx		FieldImageControl.resx
FieldOcrCharBoxsControl.Designer.cs		FieldOcrCharBoxsControl.Designer.cs
FieldOcrCharBoxsControl.cs		FieldOcrCharBoxsControl.cs
FieldOcrCharBoxsControl.resx		FieldOcrCharBoxsControl.resx
FieldOcrTextControl.Designer.cs		FieldOcrTextControl.Designer.cs
FieldOcrTextControl.cs		FieldOcrTextControl.cs
FieldOcrTextControl.resx		FieldOcrTextControl.resx
FieldOcrTextLineImagesControl.Designer.cs		FieldOcrTextLineImagesControl.Designer.cs
FieldOcrTextLineImagesControl.cs		FieldOcrTextLineImagesControl.cs
FieldOcrTextLineImagesControl.resx		FieldOcrTextLineImagesControl.resx
FieldOcrTextLinesControl.Designer.cs		FieldOcrTextLinesControl.Designer.cs
FieldOcrTextLinesControl.cs		FieldOcrTextLinesControl.cs
FieldOcrTextLinesControl.resx		FieldOcrTextLinesControl.resx
FieldPdfCharBoxsControl.Designer.cs		FieldPdfCharBoxsControl.Designer.cs
FieldPdfCharBoxsControl.cs		FieldPdfCharBoxsControl.cs
FieldPdfCharBoxsControl.resx		FieldPdfCharBoxsControl.resx
FieldPdfTextControl.Designer.cs		FieldPdfTextControl.Designer.cs
FieldPdfTextControl.cs		FieldPdfTextControl.cs
FieldPdfTextControl.resx		FieldPdfTextControl.resx
FieldPdfTextLinesControl.Designer.cs		FieldPdfTextLinesControl.Designer.cs
FieldPdfTextLinesControl.cs		FieldPdfTextLinesControl.cs
FieldPdfTextLinesControl.resx		FieldPdfTextLinesControl.resx
ImageData.cs		ImageData.cs
LICENSE		LICENSE
Ocr.cs		Ocr.cs
Page.anchors.cs		Page.anchors.cs
Page.conditions.cs		Page.conditions.cs
Page.cs		Page.cs
Page.fields.api.cs		Page.fields.api.cs
Page.fields.cs		Page.fields.cs
Page.text.cs		Page.text.cs
PageCollection.cs		PageCollection.cs
Pdf.cs		Pdf.cs
Pdf.iText7.cs		Pdf.iText7.cs
PdfDocumentParser.csproj		PdfDocumentParser.csproj
PdfDocumentParser.sln		PdfDocumentParser.sln
Program.cs		Program.cs
README.md		README.md
ScanTemplateForm.Designer.cs		ScanTemplateForm.Designer.cs
ScanTemplateForm.cs		ScanTemplateForm.cs
ScanTemplateForm.resx		ScanTemplateForm.resx
SettingsForm.Designer.cs		SettingsForm.Designer.cs
SettingsForm.cs		SettingsForm.cs
SettingsForm.resx		SettingsForm.resx
Template.Anchor.cs		Template.Anchor.cs
Template.Field.cs		Template.Field.cs
Template.cs		Template.cs
Template.parser.cs		Template.parser.cs
Template.scan.cs		Template.scan.cs
TemplateForm.Designer.cs		TemplateForm.Designer.cs
TemplateForm.TemplateManager.cs		TemplateForm.TemplateManager.cs
TemplateForm.anchors.cs		TemplateForm.anchors.cs
TemplateForm.conditions.cs		TemplateForm.conditions.cs
TemplateForm.cs		TemplateForm.cs
TemplateForm.extention.cs		TemplateForm.extention.cs
TemplateForm.fields.cs		TemplateForm.fields.cs
TemplateForm.pages.cs		TemplateForm.pages.cs
TemplateForm.resx		TemplateForm.resx
TextForm.Designer.cs		TextForm.Designer.cs
TextForm.cs		TextForm.cs
TextForm.resx		TextForm.resx
_config.yml		_config.yml

License

SergiyStoyan/PdfDocumentParser

Folders and files

Latest commit

History