.NET for Apache Spark Samples

Example implementations of .NET for Apache Spark.

In this repo, we have various example implementations of .NET for Apache Spark. These examples cover:

Sample	Language
Azure Blob Storage	C#, F#
Azure Data Lake Storage Gen1	C#, F#

Getting Started

The following guide will show you how to get samples up and running on your local machine.

Prerequisites

To get started, you'll need the following installed on your machine.

Apache Spark 2.4.1
.NET Core 3.1 SDK
JDK 8
Microsoft.Spark.Worker 0.12.1

Install Prerequisites

Apache Spark 2.4.1

Download Apache Spark 2.4.1.
Extract contents of the downloaded Apache Spark archive into the following directory.

Linux
```
~/bin/spark-2.4.1-bin-hadoop2.7
```
Windows
```
C:\bin\spark-2.4.1-bin-hadoop2.7
```

Create SPARK_HOME and HADOOP_HOME environment variables and set their values to the Apache Spark directory.

Linux

export SPARK_HOME="~/bin/spark-2.4.1-bin-hadoop2.7"
export HADOOP_HOME="~/bin/spark-2.4.1-bin-hadoop2.7"

Windows

setx SPARK_HOME "C:\bin\spark-2.4.1-bin-hadoop2.7"
setx HADOOP_HOME "C:\bin\spark-2.4.1-bin-hadoop2.7"

Verify Apache Spark and Hadoop installation.
```
spark-shell --version
```

.NET Core 3.1 SDK

Follow the instructions here: Install .NET Core on Linux or .NET Core 3.1 SDK for Windows.
Verify .NET Core SDK installation.
```
dotnet --version
```

JDK 8

Follow the instructions here:
- Windows: Open JDK: Download and install.
- Linux: Java SE Development Kit 8.
Verify JDK installation.
```
java -version
```

Microsoft.Spark.Worker 0.12.1

Download Microsoft.Spark.Worker 0.12.1.
Extract contents of the downloaded archive into the following directory.

Linux
```
~/bin/Microsoft.Spark.Worker
```
Windows
```
C:\bin\Microsoft.Spark.Worker
```

Create DOTNET_WORKER_DIR environment variable and set its value to Microsoft.Spark.Worker directory.

Linux

export DOTNET_WORKER_DIR="~/bin/Microsoft.Spark.Worker"

Windows

setx DOTNET_WORKER_DIR "C:\bin\Microsoft.Spark.Worker"

Build Samples

Clone the repo.

git clone https://github.com/usmanmohammed/dotnet-spark-samples.git

Navigate to the solution directory.
```
cd dotnet-spark-samples
```
Restore and build the solution.
```
dotnet build
```

Run Samples

Azure Blob Storage

Get your Azure Blob Storage Access Key. This can be accessed from the Azure Portal.

Create environment variables for your Blob Storage Account Name (AZURE_STORAGE_ACCOUNT) and Access Key (AZURE_STORAGE_KEY).

Linux

export AZURE_STORAGE_ACCOUNT="<storage-account-name>"
export AZURE_STORAGE_KEY="<storage-account-key>"

Windows

setx AZURE_STORAGE_ACCOUNT "<storage-account-name>"
setx AZURE_STORAGE_KEY "<storage-account-key>"

Go to build output directory.

Linux

cd src/Dotnet.Spark.Examples/Dotnet.Spark.CSharp.Examples.AzureStorage/bin/Debug/netcoreapp3.1

Windows

cd .\src\Dotnet.Spark.Examples\Dotnet.Spark.CSharp.Examples.AzureStorage\bin\Debug\netcoreapp3.1

Submit application to run on Apache Spark.

Linux

spark-submit \
--packages org.apache.hadoop:hadoop-azure:2.7.3,com.microsoft.azure:azure-storage:3.1.0 \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master local microsoft-spark-2.4.x-0.12.1.jar \
./Dotnet.Spark.CSharp.Examples.AzureStorage $AZURE_STORAGE_ACCOUNT $AZURE_STORAGE_KEY

Windows

spark-submit ^
--packages org.apache.hadoop:hadoop-azure:2.7.3,com.microsoft.azure:azure-storage:3.1.0 ^
--class org.apache.spark.deploy.dotnet.DotnetRunner ^
--master local microsoft-spark-2.4.x-0.12.1.jar ^
Dotnet.Spark.CSharp.Examples.AzureStorage %AZURE_STORAGE_ACCOUNT% %AZURE_STORAGE_KEY%

Azure Data Lake Storage Gen1

Create a Service Principal on Azure Active Directory (AAD).
Grant Service Principal Read/Write/Execute access to the Data Lake Storage account. This can be configured in the account's IAM Settings.
Retrieve the AAD Tenant (Directory) ID, Service Principal Client (Application) ID, Client Secret, and Data Lake Storage name.

Create environment variables for your Tenant ID (TENANT_ID), Data Lake name (ADLS_NAME), Service Principal Client ID (ADLS_SP_CLIENT_ID) and Client Secret (ADLS_SP_CLIENT_SECRET).

Linux

export TENANT_ID="<aad-tenant-id>"
export ADLS_NAME="<data-lake-gen1-name>"
export ADLS_SP_CLIENT_ID="<service-principal-client-id>"
export ADLS_SP_CLIENT_SECRET="<service-principal-client-key>"

Windows

setx TENANT_ID "<aad-tenant-id>"
setx ADLS_NAME "<data-lake-gen1-name>"
setx ADLS_SP_CLIENT_ID "<service-principal-client-id>"
setx ADLS_SP_CLIENT_SECRET "<service-principal-client-key>"

Go to build output directory.

Linux

cd src/Dotnet.Spark.Examples/Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1/bin/Debug/netcoreapp3.1

Windows

cd .\src\Dotnet.Spark.Examples\Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1\bin\Debug\netcoreapp3.1

Submit application to run on Apache Spark.

Linux

spark-submit \
--packages com.rentpath:hadoop-azure-datalake:2.7.3-0.1.0 \
--class org.apache.spark.deploy.dotnet.DotnetRunner \ 
--master local microsoft-spark-2.4.x-0.12.1.jar \
./Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1 $TENANT_ID $ADLS_NAME $ADLS_SP_CLIENT_ID $ADLS_SP_CLIENT_SECRET

Windows

spark-submit ^
--packages com.rentpath:hadoop-azure-datalake:2.7.3-0.1.0 ^
--class org.apache.spark.deploy.dotnet.DotnetRunner ^
--master local microsoft-spark-2.4.x-0.12.1.jar ^
Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1 %TENANT_ID% %ADLS_NAME% %ADLS_SP_CLIENT_ID% %ADLS_SP_CLIENT_SECRET%

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Feel free to open a PR.

License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
src/Dotnet.Spark.Examples		src/Dotnet.Spark.Examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

src/Dotnet.Spark.Examples

src/Dotnet.Spark.Examples

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

.NET for Apache Spark Samples

Getting Started

Prerequisites

Install Prerequisites

Apache Spark 2.4.1

.NET Core 3.1 SDK

JDK 8

Microsoft.Spark.Worker 0.12.1

Build Samples

Run Samples

Azure Blob Storage

Azure Data Lake Storage Gen1

Contributing

License

About

Releases

Packages

Languages

License

usmanmohammed/dotnet-spark-samples

Folders and files

Latest commit

History

Repository files navigation

.NET for Apache Spark Samples

Getting Started

Prerequisites

Install Prerequisites

Apache Spark 2.4.1

.NET Core 3.1 SDK

JDK 8

Microsoft.Spark.Worker 0.12.1

Build Samples

Run Samples

Azure Blob Storage

Azure Data Lake Storage Gen1

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages