The project is about creating a fully functional Apache Spark standalone cluster using Docker containers, specifically for running PySpark jobs. The infrastructure will be used in a machine learning context.
A successful candidate should expect follow-up work as the infrastructure is adjusted.
If the work is based on existing Docker files on git-hub that is OK but please reference the source!
The infrastructure will eventually be deployed using Amazon Fargate but Kubernetes on Docker would be a helpful addition as part of the submission. The immediate target deployment is using Docker on macOS 11.2.3.
Since I am not a DevOps person simplicity is preferred. I will need to be able to fully understand both Docker file and the launch script (5.a and 5.b below).
PySpark jobs are intended to be launched through SSH on the master node and monitored using Spark history server.
Below are exact requirements of the container. Differentiation requirements that would be useful but may not be necessary for a qualifying entry are in the attached [login to view URL] file.
Two reference files are provided. One Docker file (4.a) and a launch script (4.b). These files create an environment that DOES NOT meet the requirements as it only works for local Spark execution (2.r fails to run) and no HDFS node is configured (failing 2.h). However, the files may serve as reference in terms of the format and extend of the files expected in section 5.
If you believe that any versions/packages mentioned below should be changed due to known incompatibility, stability issues and the like please mention that in a clarification request so that all participants can adjust their submission accordingly.
CLARIFICATION: Note that "k) Spark history server is configured and started" is meant to run on master & DOES NOT cause neither the spark-submit nor the history server to quit unexpectedly!
You CAN PROPOSE to configure history server on separate host using HDFS logging if you believe it is more suitable.
1) BID SUBMISSION REQUIREMENTS
a) A reference project creating a Spark standalone cluster using Docker (evidenced by screenshot or git-hub link)
b) Estimated project duration (elapsed time)
c) Project bid for MINIMUM REQUIREMENTS in section 2 (USD)
d) Additional estimated cost to complete the ADDITIONAL REQUIREMENTS in section 3 (USD)
d) Number of hours included to support setup
e) Hourly rate for additional work on this project (USD)
2 MINIMUM REQUIREMENTS FOR PROJECT
a) One compute unit consists of: 1 master, 3 worker nodes and 1 storage node
b) Only official Docker base images are to be used
c) All downloads and configurations to be performed transparently on the Docker file
d) Downloads only from official Debian, Apache, Oracle or Python locations
e) Python 3.6 or later is configured (3.8.8. is preferred)
f) At least Java 8 is used for JDK (OpenJDK 11 preferred)
g) Spark 3.1.1 pre-built for Hadoop 3.2 for the Spark environment
h) Hadoop 3.2 is configured for the storage nodes
i) Spark is configured in Standalone cluster mode
j) Spark master and workers are given sufficient resource configuration
k) Spark history server is configured and started
l) A HDFS volume is configured with a single "data" directory
m) Spark is configured to use HDFS volume "data"
n) Following python packages are installed numpy pandas scikit-learn sklearn pyspark matplotlib
o) A Spark master is launched with three worker nodes registered
p) The master node SSHD is running and SSH connections are possible to the master node to launch jobs
q) The master node has a running spark history server and connections are possible
r) Successfully "run docker exec master spark-submit --master spark://master:7077 <spark_home>/examples/src/main/python/[login to view URL]"
3 ADDITIONAL REQUIREMENTS FOR DIFFERENTIATION: See [login to view URL]
4 REFERENCE INFORMATION: See [login to view URL]
5 PROJECT SUBMISSION COMPONENTS: See [login to view URL]