#Big Data in .NET using Hadoop MapReduce
###Core components
- Microsoft.Hadoop.MapReduce
- HDFS (Microsoft HDInsight Emulator)
Application Description
The below links are to datasets containing NHS data. You will find data on UK practices and their prescriptions prescribed in a year.
http://datagov.ic.nhs.uk/T201202ADD%20REXT.CSV http://datagov.ic.nhs.uk/T201109PDP%20IEXT.CSV
The solution should parses these datasets and answers the following questions:
- How many practices are in London?
- What was the average actual cost of all peppermint oil prescriptions?
- Which 5 post codes have the highest actual spend, and how much did each spend in total?
- For each region of England (North East, South West, London, etc.):
- What was the average price per prescription of Flucloxacillin (excluding Co-Fluampicil)?
- How much did this vary from the national mean?
- Come up with your own interesting question about NHS prescriptions in England and use the data (plus any other sources you'd like to use) to answer it.
I referenced following useful startup tutorials to configure HDInsight, Hadoop and writing MapReduce functions in C#
- https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-hadoop-emulator-get-started/
- https://azure.microsoft.com/en-gb/documentation/articles/hdinsight-hadoop-develop-deploy-streaming-jobs/
- https://www.youtube.com/watch?v=uyi41nrhlhw&feature=youtu.be
- https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=10354721
Following Improvements need to fix in existing code – future commits will cover these
- Chain MapReduce for Multiple Jobs
- https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/lib/ChainMapper.html
- https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/lib/ChainReducer.html
- Efficient Sorting mechanism, Combine/join two datasets in efficient way, Sort Comparator and Group Comparator
- The fourth question needs an additional datasets to find all the Postcodes inside Region. So that grouped Postcodes results goes to the Region based results
- Using the SDK, tried to use the MapperContext.InputFileName property: it is always empty. Need to find the clean way to distinguish datasets in Mapper