Skip to content

usmanmohammed/dotnet-spark-samples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

.NET for Apache Spark Samples

Example implementations of .NET for Apache Spark.

.NET Core

In this repo, we have various example implementations of .NET for Apache Spark. These examples cover:

Sample Language
Azure Blob Storage C#,   F#
Azure Data Lake Storage Gen1 C#,   F#

Getting Started

The following guide will show you how to get samples up and running on your local machine.

Prerequisites

To get started, you'll need the following installed on your machine.

  1. Apache Spark 2.4.1
  2. .NET Core 3.1 SDK
  3. JDK 8
  4. Microsoft.Spark.Worker 0.12.1

Install Prerequisites

Apache Spark 2.4.1

  1. Download Apache Spark 2.4.1.

  2. Extract contents of the downloaded Apache Spark archive into the following directory.

    Linux

    ~/bin/spark-2.4.1-bin-hadoop2.7

    Windows

    C:\bin\spark-2.4.1-bin-hadoop2.7
  3. Create SPARK_HOME and HADOOP_HOME environment variables and set their values to the Apache Spark directory.

    Linux

    export SPARK_HOME="~/bin/spark-2.4.1-bin-hadoop2.7"
    export HADOOP_HOME="~/bin/spark-2.4.1-bin-hadoop2.7"

    Windows

    setx SPARK_HOME "C:\bin\spark-2.4.1-bin-hadoop2.7"
    setx HADOOP_HOME "C:\bin\spark-2.4.1-bin-hadoop2.7"
  4. Verify Apache Spark and Hadoop installation.

    spark-shell --version

.NET Core 3.1 SDK

  1. Follow the instructions here: Install .NET Core on Linux or .NET Core 3.1 SDK for Windows.

  2. Verify .NET Core SDK installation.

    dotnet --version

JDK 8

  1. Follow the instructions here:

  2. Verify JDK installation.

    java -version

Microsoft.Spark.Worker 0.12.1

  1. Download Microsoft.Spark.Worker 0.12.1.

  2. Extract contents of the downloaded archive into the following directory.

    Linux

    ~/bin/Microsoft.Spark.Worker

    Windows

    C:\bin\Microsoft.Spark.Worker
  3. Create DOTNET_WORKER_DIR environment variable and set its value to Microsoft.Spark.Worker directory.

    Linux

    export DOTNET_WORKER_DIR="~/bin/Microsoft.Spark.Worker"

    Windows

    setx DOTNET_WORKER_DIR "C:\bin\Microsoft.Spark.Worker"

Build Samples

  1. Clone the repo.

    git clone https://github.com/usmanmohammed/dotnet-spark-samples.git
  2. Navigate to the solution directory.

    cd dotnet-spark-samples
  3. Restore and build the solution.

    dotnet build

Run Samples

Azure Blob Storage

  1. Get your Azure Blob Storage Access Key. This can be accessed from the Azure Portal.

  2. Create environment variables for your Blob Storage Account Name (AZURE_STORAGE_ACCOUNT) and Access Key (AZURE_STORAGE_KEY).

    Linux

    export AZURE_STORAGE_ACCOUNT="<storage-account-name>"
    export AZURE_STORAGE_KEY="<storage-account-key>"

    Windows

    setx AZURE_STORAGE_ACCOUNT "<storage-account-name>"
    setx AZURE_STORAGE_KEY "<storage-account-key>"
  3. Go to build output directory.

    Linux

    cd src/Dotnet.Spark.Examples/Dotnet.Spark.CSharp.Examples.AzureStorage/bin/Debug/netcoreapp3.1

    Windows

    cd .\src\Dotnet.Spark.Examples\Dotnet.Spark.CSharp.Examples.AzureStorage\bin\Debug\netcoreapp3.1
  4. Submit application to run on Apache Spark.

    Linux

    spark-submit \
    --packages org.apache.hadoop:hadoop-azure:2.7.3,com.microsoft.azure:azure-storage:3.1.0 \
    --class org.apache.spark.deploy.dotnet.DotnetRunner \
    --master local microsoft-spark-2.4.x-0.12.1.jar \
    ./Dotnet.Spark.CSharp.Examples.AzureStorage $AZURE_STORAGE_ACCOUNT $AZURE_STORAGE_KEY

    Windows

    spark-submit ^
    --packages org.apache.hadoop:hadoop-azure:2.7.3,com.microsoft.azure:azure-storage:3.1.0 ^
    --class org.apache.spark.deploy.dotnet.DotnetRunner ^
    --master local microsoft-spark-2.4.x-0.12.1.jar ^
    Dotnet.Spark.CSharp.Examples.AzureStorage %AZURE_STORAGE_ACCOUNT% %AZURE_STORAGE_KEY%

Azure Data Lake Storage Gen1

  1. Create a Service Principal on Azure Active Directory (AAD).

  2. Grant Service Principal Read/Write/Execute access to the Data Lake Storage account. This can be configured in the account's IAM Settings.

  3. Retrieve the AAD Tenant (Directory) ID, Service Principal Client (Application) ID, Client Secret, and Data Lake Storage name.

  4. Create environment variables for your Tenant ID (TENANT_ID), Data Lake name (ADLS_NAME), Service Principal Client ID (ADLS_SP_CLIENT_ID) and Client Secret (ADLS_SP_CLIENT_SECRET).

    Linux

    export TENANT_ID="<aad-tenant-id>"
    export ADLS_NAME="<data-lake-gen1-name>"
    export ADLS_SP_CLIENT_ID="<service-principal-client-id>"
    export ADLS_SP_CLIENT_SECRET="<service-principal-client-key>"

    Windows

    setx TENANT_ID "<aad-tenant-id>"
    setx ADLS_NAME "<data-lake-gen1-name>"
    setx ADLS_SP_CLIENT_ID "<service-principal-client-id>"
    setx ADLS_SP_CLIENT_SECRET "<service-principal-client-key>"
  5. Go to build output directory.

    Linux

    cd src/Dotnet.Spark.Examples/Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1/bin/Debug/netcoreapp3.1

    Windows

    cd .\src\Dotnet.Spark.Examples\Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1\bin\Debug\netcoreapp3.1
  6. Submit application to run on Apache Spark.

    Linux

    spark-submit \
    --packages com.rentpath:hadoop-azure-datalake:2.7.3-0.1.0 \
    --class org.apache.spark.deploy.dotnet.DotnetRunner \ 
    --master local microsoft-spark-2.4.x-0.12.1.jar \
    ./Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1 $TENANT_ID $ADLS_NAME $ADLS_SP_CLIENT_ID $ADLS_SP_CLIENT_SECRET

    Windows

    spark-submit ^
    --packages com.rentpath:hadoop-azure-datalake:2.7.3-0.1.0 ^
    --class org.apache.spark.deploy.dotnet.DotnetRunner ^
    --master local microsoft-spark-2.4.x-0.12.1.jar ^
    Dotnet.Spark.CSharp.Examples.AzureDataLakeStorageGen1 %TENANT_ID% %ADLS_NAME% %ADLS_SP_CLIENT_ID% %ADLS_SP_CLIENT_SECRET%

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Feel free to open a PR.

License

Distributed under the MIT License. See LICENSE for more information.

About

Example implementations of .NET for Apache Spark

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published