Example implementations of .NET for Apache Spark.
In this repo, we have various example implementations of .NET for Apache Spark. These examples cover:
Sample | Language |
---|---|
Azure Blob Storage | C#, F# |
Azure Data Lake Storage Gen1 | C#, F# |
The following guide will show you how to get samples up and running on your local machine.
To get started, you'll need the following installed on your machine.
- Apache Spark 2.4.1
- .NET Core 3.1 SDK
- JDK 8
- Microsoft.Spark.Worker 0.12.1
-
Download Apache Spark 2.4.1.
-
Extract contents of the downloaded Apache Spark archive into the following directory.
Linux
~/bin/spark-2.4.1-bin-hadoop2.7
Windows
C:\bin\spark-2.4.1-bin-hadoop2.7
-
Create
SPARK_HOME
andHADOOP_HOME
environment variables and set their values to the Apache Spark directory.Linux
export SPARK_HOME="~/bin/spark-2.4.1-bin-hadoop2.7" export HADOOP_HOME="~/bin/spark-2.4.1-bin-hadoop2.7"
Windows
setx SPARK_HOME "C:\bin\spark-2.4.1-bin-hadoop2.7" setx HADOOP_HOME "C:\bin\spark-2.4.1-bin-hadoop2.7"
-
Verify Apache Spark and Hadoop installation.
spark-shell --version
-
Follow the instructions here: Install .NET Core on Linux or .NET Core 3.1 SDK for Windows.
-
Verify .NET Core SDK installation.
dotnet --version
-
Follow the instructions here:
- Windows: Open JDK: Download and install.
- Linux: Java SE Development Kit 8.
-
Verify JDK installation.
java -version
-
Download Microsoft.Spark.Worker 0.12.1.
-
Extract contents of the downloaded archive into the following directory.
Linux
~/bin/Microsoft.Spark.Worker
Windows
C:\bin\Microsoft.Spark.Worker
-
Create
DOTNET_WORKER_DIR
environment variable and set its value to Microsoft.Spark.Worker directory.Linux
export DOTNET_WORKER_DIR="~/bin/Microsoft.Spark.Worker"
Windows
setx DOTNET_WORKER_DIR "C:\bin\Microsoft.Spark.Worker"
-
Clone the repo.
git clone https://github.com/usmanmohammed/dotnet-spark-samples.git
-
Navigate to the solution directory.
cd dotnet-spark-samples
-
Restore and build the solution.
dotnet build
-
Get your Azure Blob Storage Access Key. This can be accessed from the Azure Portal.
-
Create environment variables for your Blob Storage Account Name (
AZURE_STORAGE_ACCOUNT
) and Access Key (AZURE_STORAGE_KEY
).Linux
export AZURE_STORAGE_ACCOUNT="<storage-account-name>" export AZURE_STORAGE_KEY="<storage-account-key>"
Windows
setx AZURE_STORAGE_ACCOUNT "<storage-account-name>" setx AZURE_STORAGE_KEY "<storage-account-key>"
-
Go to build output directory.
Linux
cd src/Dotnet.Spark.Examples/Dotnet.Spark.CSharp.Examples.AzureStorage/bin/Debug/netcoreapp3.1
Windows
cd .\src\Dotnet.Spark.Examples\Dotnet.Spark.CSharp.Examples.AzureStorage\bin\Debug\netcoreapp3.1
-
Submit application to run on Apache Spark.
Linux
spark-submit \ --packages org.apache.hadoop:hadoop-azure:2.7.3,com.microsoft.azure:azure-storage:3.1.0 \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local microsoft-spark-2.4.x-0.12.1.jar \ ./Dotnet.Spark.CSharp.Examples.AzureStorage $AZURE_STORAGE_ACCOUNT $AZURE_STORAGE_KEY
Windows
spark-submit ^ --packages org.apache.hadoop:hadoop-azure:2.7.3,com.microsoft.azure:azure-storage:3.1.0 ^ --class org.apache.spark.deploy.dotnet.DotnetRunner ^ --master local microsoft-spark-2.4.x-0.12.1.jar ^ Dotnet.Spark.CSharp.Examples.AzureStorage %AZURE_STORAGE_ACCOUNT% %AZURE_STORAGE_KEY%
-
Create a Service Principal on Azure Active Directory (AAD).
-
Grant Service Principal Read/Write/Execute access to the Data Lake Storage account. This can be configured in the account's IAM Settings.
-
Retrieve the AAD Tenant (Directory) ID, Service Principal Client (Application) ID, Client Secret, and Data Lake Storage name.
-
Create environment variables for your Tenant ID (
TENANT_ID
), Data Lake name (ADLS_NAME
), Service Principal Client ID (ADLS_SP_CLIENT_ID
) and Client Secret (ADLS_SP_CLIENT_SECRET
).Linux
export TENANT_ID="<aad-tenant-id>" export ADLS_NAME="<data-lake-gen1-name>" export ADLS_SP_CLIENT_ID="<service-principal-client-id>" export ADLS_SP_CLIENT_SECRET="<service-principal-client-key>"
Windows
setx TENANT_ID "<aad-tenant-id>" setx ADLS_NAME "<data-lake-gen1-name>" setx ADLS_SP_CLIENT_ID "<service-principal-client-id>" setx ADLS_SP_CLIENT_SECRET "<service-principal-client-key>"
-
Go to build output directory.
Linux
cd src/Dotnet.Spark.Examples/Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1/bin/Debug/netcoreapp3.1
Windows
cd .\src\Dotnet.Spark.Examples\Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1\bin\Debug\netcoreapp3.1
-
Submit application to run on Apache Spark.
Linux
spark-submit \ --packages com.rentpath:hadoop-azure-datalake:2.7.3-0.1.0 \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local microsoft-spark-2.4.x-0.12.1.jar \ ./Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1 $TENANT_ID $ADLS_NAME $ADLS_SP_CLIENT_ID $ADLS_SP_CLIENT_SECRET
Windows
spark-submit ^ --packages com.rentpath:hadoop-azure-datalake:2.7.3-0.1.0 ^ --class org.apache.spark.deploy.dotnet.DotnetRunner ^ --master local microsoft-spark-2.4.x-0.12.1.jar ^ Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1 %TENANT_ID% %ADLS_NAME% %ADLS_SP_CLIENT_ID% %ADLS_SP_CLIENT_SECRET%
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Feel free to open a PR.
Distributed under the MIT License. See LICENSE
for more information.