GitHub - miguelbandera/PdfDocumentParser: PdfDocumentParser is a .NET toolset for building PDF parsers.

PdfDocumentParser

PdfDocumentParser is a parsing engine intended to find and extract text/images from PDF documents that conform to predictable graphic layouts - such as reports, bills, forms, tickets and the like. Its parsing approach is based on finding certain text or image fragments in page and then extracting text/images located relatively to those fragments.

PdfDocumentParser does all the tricky job of building parsing templates, search, recognition and extraction, thus, leaving you only to code a custom logic.

PdfDocumentParser is a .NET DLL.

For a sample of using PdfDocumentParser or a framework refer to SampleParser project in the repository.

Known issues

because it is WinForm, GUI may appear mangled in UHD display (or otherwise in FHD, depending on version). Don't be afraid: you can open it in VS and tune for your resolution. WPF version is in freezing developement...

Documentation

Support

Contact me if you want me to enhance PdfDocumentParser. Also, you can hire me for solving a parsing task of any complexity or for general development.

Name		Name	Last commit message	Last commit date
Latest commit History 983 Commits
CliverRoutines		CliverRoutines
CliverWinRoutines		CliverWinRoutines
Properties		Properties
SampleParser		SampleParser
Settings		Settings
docs		docs
docs_files		docs_files
externals		externals
.gitattributes		.gitattributes
.gitignore		.gitignore
3RINGS~1.ICO		3RINGS~1.ICO
AboutBox.Designer.cs		AboutBox.Designer.cs
AboutBox.cs		AboutBox.cs
AboutBox.resx		AboutBox.resx
AnchorControl.Designer.cs		AnchorControl.Designer.cs
AnchorControl.cs		AnchorControl.cs
AnchorCvImageControl.Designer.cs		AnchorCvImageControl.Designer.cs
AnchorCvImageControl.cs		AnchorCvImageControl.cs
AnchorCvImageControl.resx		AnchorCvImageControl.resx
AnchorImageDataControl.Designer.cs		AnchorImageDataControl.Designer.cs
AnchorImageDataControl.cs		AnchorImageDataControl.cs
AnchorImageDataControl.resx		AnchorImageDataControl.resx
AnchorOcrTextControl.Designer.cs		AnchorOcrTextControl.Designer.cs
AnchorOcrTextControl.cs		AnchorOcrTextControl.cs
AnchorOcrTextControl.resx		AnchorOcrTextControl.resx
AnchorPdfTextControl.Designer.cs		AnchorPdfTextControl.Designer.cs
AnchorPdfTextControl.cs		AnchorPdfTextControl.cs
AnchorPdfTextControl.resx		AnchorPdfTextControl.resx
AnchorScriptControl.Designer.cs		AnchorScriptControl.Designer.cs
AnchorScriptControl.cs		AnchorScriptControl.cs
AnchorScriptControl.resx		AnchorScriptControl.resx
BitmapPreprocessor.cs		BitmapPreprocessor.cs
BooleanEngine.cs		BooleanEngine.cs
CvImage.cs		CvImage.cs
Deskewer.cs		Deskewer.cs
ImageData.cs		ImageData.cs
LICENSE		LICENSE
Ocr.cs		Ocr.cs
Ocr.tesseract.4.cs		Ocr.tesseract.4.cs
Page.anchors.cs		Page.anchors.cs
Page.conditions.cs		Page.conditions.cs
Page.cs		Page.cs
Page.fields.cs		Page.fields.cs
Page.text.cs		Page.text.cs
PageCollection.cs		PageCollection.cs
Pdf.cs		Pdf.cs
PdfDocumentParser.csproj		PdfDocumentParser.csproj
PdfDocumentParser.sln		PdfDocumentParser.sln
Program.cs		Program.cs
README.md		README.md
ScanTemplateForm.Designer.cs		ScanTemplateForm.Designer.cs
ScanTemplateForm.cs		ScanTemplateForm.cs
ScanTemplateForm.resx		ScanTemplateForm.resx
SettingsForm.Designer.cs		SettingsForm.Designer.cs
SettingsForm.cs		SettingsForm.cs
SettingsForm.resx		SettingsForm.resx
TableRowControl.Designer.cs		TableRowControl.Designer.cs
TableRowControl.cs		TableRowControl.cs
TableRowControl.resx		TableRowControl.resx
Template.Anchor.cs		Template.Anchor.cs
Template.Field.cs		Template.Field.cs
Template.cs		Template.cs
Template.scan.cs		Template.scan.cs
TemplateForm.Designer.cs		TemplateForm.Designer.cs
TemplateForm.TemplateManager.cs		TemplateForm.TemplateManager.cs
TemplateForm.anchors.cs		TemplateForm.anchors.cs
TemplateForm.conditions.cs		TemplateForm.conditions.cs
TemplateForm.cs		TemplateForm.cs
TemplateForm.extention.cs		TemplateForm.extention.cs
TemplateForm.fields.cs		TemplateForm.fields.cs
TemplateForm.pages.cs		TemplateForm.pages.cs
TemplateForm.resx		TemplateForm.resx
TextForm.Designer.cs		TextForm.Designer.cs
TextForm.cs		TextForm.cs
TextForm.resx		TextForm.resx
_config.yml		_config.yml
app.config		app.config
app.manifest		app.manifest
computers308.ico		computers308.ico

License

miguelbandera/PdfDocumentParser

Folders and files

Latest commit

History