Skip to content

boydc7/textpipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TextPipe

git clone https://github.com/boydc7/textpipe
cd textpipe
dotnet publish -c Release -o ../../publish src/TextWrangler.Console/TextWrangler.Console.csproj
  • Run test suites (from textpipe folder from above)
dotnet test tests/TextWrangler.UnitTests/TextWrangler.UnitTests.csproj
dotnet test tests/TextWrangler.IntegrationTests/TextWrangler.IntegrationTests.csproj
  • Run (from textpipe folder from above)

NOTE: The sample.csv file is a copy of the sample Gist provided from the takehome doc. The recordSample recordType in the textwrangler.json config file can be used to map from any file with data similar to that to the target model record requested in the takehome doc.

# Show usage:
dotnet publish/wrangle.dll

# Wrangle sample.csv displaying output in log (console):
dotnet publish/wrangle.dll recordSample publish/sample.csv publish/textwrangler.json

# Wrangle sample.csv sending output to sample_out.csv file:
dotnet publish/wrangle.dll recordSample publish/sample.csv publish/textwrangler.json 0 publish/sample_out.csv

# Wrangle large-ish sales csv sending output to large_sales_out.csv file:
# Download and extract large sales file here:
# http://eforexcel.com/wp/wp-content/uploads/2017/07/1500000%20Sales%20Records.zip
dotnet publish/wrangle.dll salesSample "publish/1500000 Sales Records.csv" publish/textwrangler.json 0 publish/large_sales_out.csv

This is basically a simple 5-component ETL pipeline. Each of the 5 component dependencies can be injected to the default ITextWrangler implementation to adjust the runtime behaviour. To create and run a simple pipeline, initialize an instance of TextWrangler (or any concrete implementation of ITextWrangler) and pass the appropriate args, like this:

using(var wrangler = new TextWrangler(recordConfigName,
                                      new CsvRecordReader(fileName))
{
    wrangler.Wrangle(limit);
}

The general flow of the data through the pipeline is as follows:

IRecordReader -> IRecordBuilder -> IRecordFormatter -> IRecordFilterService -> IRecordWriter

The actual pipeline is a simple laziliy produced enumerable of entities that flow from one component to the next, which then processes each entity in turn and produces it as output as well.

IRecordReader

Responsible for reading source records and producing a map of labeled source values (labeled by field name or index depending on the abilities of the source). A CSV source for example could have a header which would allow for labeling the data by "field name", or it may not (or may have header values for only some fields in the CSV) in which case it may label the data by field index. Included concrete implementations are:

IRecordBuilder

Responsible for turning IRecordReader source maps into initial representations of target record models including optionally filtering the source values with the injected IFieldFilterService (if the source configuration for a given field includes filters to be applied to the source value - see config file ref). A single concrete implementation is included:

IRecordFormatter

Responsible for manipulating target field values in some specific way to format output, prepare the value for other formatters to work, etc. Included concrete implementations are:

IRecordFilterService

Responsible for passing target values through zero or more filters (that are mapped to the target field in the config file (see config file ref). Some filters manipulate the data (i.e. trim, uppercase, titlecase, etc.) and some simply validate that data matches a certain restriction (and will throw an exception if not), i.e. alpha filter, alphanumeric filter, etc. The service is not a filter itself, it simply uses IFieldFilter implementations to operate on a given field if a field requires it. To see various included IFieldFilter implementations, see the Filters folder.

IRecordWriter

Responsible for writing final target records to something. Included concrete implementations include:

There's obviously some documentation here in this README, there is also a fairly large amount of XML documentation comments on the various types and methods in the source code for reference.

See the config file ref section for an overview of the config file format, requirements, samples, etc.

  • From a usability perspective I'd consider removing the current requirement of having the source fields for use on each given target field from being required in the sources list in the config. You'd still have to include it if you want filter it in some way (as you'd have to indicate the filter(s) to run it through), but you shouldn't necessarily have to have it mapped there if you just want to grab it's raw value and use it.
  • Currently for any type of logging/output stream (i.e. error records, log data, etc.) you do not have context from different stages in the pipeline to include with output (for example, if a filter throws an exception, it does not currently have context of what the source values were, which record it is, etc. Would be pretty simple to provide some thread/async local storage through the pipeline to give that context, which would allow for much "better" stream output
  • From an operational perspective I'd look at adding likely significant logging/profiling options to allow for easier troubleshooting, tracking, etc.
  • I'd certainly consider extending the integration testing footprint specifically, unit testing is fairly well covered, but could also likely be extended a bit

The TextWrangler config file is a JSON formatted file responsible for defining one or more target record types that map one or more source fields to one or more target fields within the record.

A simple sample config file that represents most (all?) of the functionality is as follows:

{
    "orderRecord": {                    // A mapping of an OrderRecord"
      "fields": [                       // One or more fields that make up the order
        {
            "name": "OrderId",          // Target field named OrderId (Target fields are outputs)  
            "format": "<Order Num>",    // The source field(s) and format specifier used to create and
                                        // format the target field
            "type": "System.Int64",     // An optional type that the final target field value MUST 
                                        // convert into successfully
            "sources": [                // Zero or more source field(s) that are used to build this 
                                        // target field
              {         
                "name": "Order Num",    // The name of the source field. To reference a source field 
                                        // by name, it must come from a file format that supports 
                                        // named fields, i.e. a CSV file with valid headers, JSON,...
                "filters": [            // One or more filters to apply to the source field before using
                  "trim",               // These modify the source value in some way, and do not 
                  "upper",              // fail/throw exceptions based on the contents of the field
                                        // a value is simply trimmed, or upper-cased (if upper-
                  "titlecase"           // caseable) or title-cased...
                ]
              }
            ]
          },
          {
            "name": "OrderAmount",      // 2nd target field named OrderAmount
            "format": "{<Amt>,0:N4}",   // Formats can include any viable .NET format string (this
                                        // one will format the source field Amt value from a double 
                                        // into a number with culture-specific thousands separators
                                        // and 4 decimal places (rounded)
            "sources": [
              {
                "name": "Amt",          // The source field 
                "type": "System.Double"  // If the source field is used in a format string, the 
              }                         // type is required to make it work contextually correctly
            ]
          },
          {
            "name": "OrderDate",        // The 3rd target field named OrderDate
            "format": "{<Year>:2000}-{<Month>:00}-{<Day>:00}T00:00:00Z",  
                      // Above is an ISO 8601 format date with zulu tz build from 3 source fields
                      // The Year will be a 4-digit year (if 2 digits it will be 20xx), month and
                      // day will be 2 digit days padded with a 0 on teh left
            "sources": [
              {                          // Single target made of 3 different source fields
                "name": "Year",
                "type": "System.Int32"   // Each with a type specified to work contextually in 
              },                         // the format string
              {
                "name": "Month",
                "type": "System.Int32"
              },
              {
                "name": "Day",
                "type": "System.Int32"
              }
            ]
          },
          {
            "name": "Unit",               // 4th target field
            "format": "miles"             // Just a static value
          },
          {
            "name": "ItemName",           // 5th target field
            "format": "<Item ID>",        // Made from a single source field
            "sources": [
              {
                "name": "Item ID",
                "filters": [
                  "alphanumeric",         // Some filters are validators that do not change the field but verify
                  "alpha"                 // they contain valid data and throw an excpeption otherwise
                ]
              }
            ]
          }
      ]
    }
}

There are also multiple sample config files included in the project for reference use.

You can also refer to the XML documentation on the RecordConfiguration.cs and FieldConfiguration.cs classes for more detailed information.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages