from a data set containing.
Change,Country,Location,Name,NameWoDiacritics,Subdivision,Status,Function,Date,IATA,Coordinates,Remarks
and mappings for
CountryCode,CountryName
FunctionCode,FunctionDescription
where Country_Code == Country, country_name == CountryName == Country, Location_Code==Location,Location_Name=Name,Location_Type=FunctionDescription==Function, Longitude ==Extract(Coordinate), Longitude ==Extract(Coordinate)
produce in following format
~ Note: I have opted for developing a microservice,
- Windows OS or Linux Capable of running .NetCore Applications.
- .NetCore 2.1
Design Documentation - tells you how it works under the hood
Getting started - tells you how to use this application to perform etl
- SQl
- Export Csv & Lookup information to Sql database
- Perform Iterative Cleaning to Csv Data with basic tools
- Write queries to Create Projections
- Export projections to TSV or expected output format
- Advantage
- easy to work with data (for small data)
- easy to understand & general tool knowledge
- time to achieve solution is minimum
- Downside
- Not Modular - tight coupling caused by scripts
- Not a scalable, data size limits
- Dependency on Sql
- will Manual Intervention is needed to progress through steps
- Cannot be Packaged & deployed to environments easly
- Export Csv & Lookup information to Sql database
- Use Python tools Like CSV-KIt & sql Alchemy
- Omitted due to personal preference & scarcity of available machine for installing tools.
- Developing a Tool / Micro Service
- Build a software tool to reading the incoming csv
- Take Runtime configuration
- Auto validate entries
- If faulty, Auto apply fixes - fixing data by inference
- Perform Data Cleansing
- Create Projections as mentioned in output confgi
- By using auto dependency discovery & preconfigurations
- Perform Final Checksum of data
- Advantages
- Extensibility - Can be made to Source and Sink from any to any
- Modularity& Resusablilty - Source to Sink transformations can be reused for other datasets * services
- Easly can be made as General service
- Performance & efficiency of resources
- No dependency on databases
- Disadvantages
- Huge Effort & time for building framework
- maintenance of tool & relate Knowledge about tool
- learning curve of new tool