Skip to content

modulexcite/DataProfiler

 
 

Repository files navigation

DataProfiler

Data Profiler is an open source .NET based command line tool (and library) for profiling delimited files, Excel (the 1st sheet), and database tables. It uses Transformalize and Junk Drawer to do most of it's work.

It is released under GNU General Public License, version 3 (GPL-3.0).

The command line tool is dp.exe. It takes 1 to 3 arguments:


####Argument 1, Input: Input should be a file name or fully qualified database table name.

file:

dp c:\temp\fantasy\fantasy2013.txt
FieldTypeIndexDistinct CountMin ValueMax ValueMin LengthMax Length
Rankint161598159813
Namestring2597A.J. Green*Zach Sudfeld823
Teamstring3352TMWAS33
Agebyte418213822
Gamesbyte51611612
GamesStartedbyte61701612
PassingCmpint16758045013
PassingAttint16861065913
PassingYdsint169710547714
PassingTDbyte103005512
PassingIntbyte112102712
RushingAttint1612109031413
RushingYdsint1613174-31160714
RushingYAsingle14198-94515
RushingTDbyte151301212
ReceivingRecbyte1692011313
ReceivingYdsint1617304-9164614
ReceivingYRsingle18334-55715
ReceivingTDbyte191601612
FantPosstring204QBWR22
FantPointsint1621176-341013
VBDbyte2256018213
PosRankbyte23222122313
OvRankbyte247907812

table:

dp localhost.NorthWind.dbo.Customers
FieldTypeIndexDistinct CountMin ValueMax ValueMin LengthMax Length
CustomerIDstring191ALFKIWOLZA55
CompanyNamestring291Alfreds FutterkisteWolski Zajazd836
ContactNamestring391Alejandra CaminoZbyszek Piestrzeniewicz823
ContactTitlestring412Accounting ManagerSales Representative530
Addressstring5911 rue Alsace-LorraineWalserweg 211146
Citystring669AachenWarszawa415
Regionstring719WY013
PostalCodestring887WX3 6FW09
Countrystring921ArgentinaVenezuela211
Phonestring1091(02) 201 24 67981-443655817
Faxstring1170981-443655017
RowVersionbyte[]1291System.Byte[]System.Byte[]1313

The fully qualified database table name must indicate the Server, the Database, the Schema, and the Table name. It must be delimited by dots. For example, Server.Database.Schema.Table is correct. When your server name contains dots, surround it with brackets (e.g. [Server.com]). If your provider doesn't support schema, use Server.Database.Table.

By default, database objects assume a provider of sqlserver. You may configure it to expect mysql, or postgresql as well. If you're not using SQL Server with trusted authentication, you may configure ports, user names, and passwords as needed in dp.exe.config.


####Argument 2, Output:

Output is optional. If you don't provide it, it will be whatever is defined as output in your DataProfiler process (see dp.exe.config). It is usually data-profile.html. If your output file has an .html extension, it is rendered and opened as an HTML table with Bootstrap styling. You may also choose a .txt or .csv extension and it will open in NotePad or Excel respectively.


####Argument 3, Sample Percentage:

Sample Percentage is optional. If you don't need a perfect data profile, use an integer between 1 and 99 to sample a percentage of the data. This can really speed things up on bigger files (at the cost of accuracy though).


####dp.exe.config

<transformalize>
	<processes>
		<!-- JunkDrawer process is for profiling files -->
		<add name="JunkDrawer">
			<connections>
				<add name="input" provider="file" file="*"/>
				<add name="output" provider="internal"/>
			</connections>
		</add>
		<add name="DataProfiler">
			<connections>
				<!-- by default, provider is sqlserver -->
				<add name="input" />

				<!-- for other database providers -->
				<!-- <add name="input" provider="mysql" port="*" user="*" password="*" /> -->
				<!-- <add name="input" provider="postgresql" port="*" user="*" password="*" /> -->

				<!-- default output is data-profile.html, in the same folder your dp.exe is in -->
				<add name="output" provider="html" file="data-profile.html" />
			</connections>
			<actions>
				<!-- remove this if you don't want data-profile.html to open after profile is complete -->
				<add action="open" connection="output" />
			</actions>
			<entities>
				<!-- you can control the default sample size here, and whether or not nolock hint is used -->
				<add name="*" sample="100" no-lock="true" />
			</entities>
		</add>
	</processes>
</transformalize>

About

Data Profiler (for data, not code)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C# 100.0%