Job Description:

I built an ETL pipeline to process terabytes of data. To achieve that goal, I setup a Spark Cluster (Scala) and MinIO server for object data storage.

I can process and save 200 gigabytes in roughly 30 minutes using 10 virtual machines, for Spark Processing.

The issue I have is that I am not able to scale that Processing. Meaning if I double the number of spark virtual machines, this does not affect processing time.

I need a Data Architect who has enough expertise to help me identify the bottleneck and fix the issue.


• I use virtual machines set up on-premises using VMWare ESXi 6

• Physical machines (which host VMs) are on a 1 GB network.

• There is no over commitment for vCPU nor RAM

• Spark VMs. 16VCPU, 64 GB RAM

• MinIO (Storage). 16vCPU, 64GB RAM, Configured using RAID0


The process is straight.

• Read data from 2 sources on MinIO,

• Make a Union of data of two sources,

• Filter out empty values on a column from resulting dataset,

• Apply 2 groupby on that column (We save intermediate values after the first groupby)

• Union the dataset obtained after the groupby operation with the empty columns values

• Save the whole again on MinIO

Beceriler: VMware, Spark, Data Engineer, Amazon S3, Büyük Veri

Müşteri Hakkında:
( 5 değerlendirme ) SAINT DENIS, France

Proje NO: #35893478

Bu iş için 5 freelancer ortalamada €334 teklif veriyor


Hi there,I am excited to share my expertise and skills in data engineering and Big data, which I have acquired over the past 3 years. I am confident that I can meet your requirements. I would be delighted to work with Daha Fazla

€140 EUR in 5 gün içinde
(1 Yorum)

Hi there, How are you? I have gone through your project details. I would like to tell you that l have a great bunch of experience in VMware, Spark, Data Engineer, Big Data and Amazon S3. For that I would require from Daha Fazla

€250 EUR in 8 gün içinde
(0 Değerlendirme)

Hi Saint Denis, I am a Data Engineer with 7+year of experience. I would like to offer you help to fix this issue. Please let me know if we can connect .

€140 EUR in 7 gün içinde
(0 Değerlendirme)

Hi, I hv ,,10 years of exp in this. I would like to work for you. As i have already did the similar task and supported many projects/person in the same way etc. I would like to hear from your side.  Thank you for

€140 EUR in 7 gün içinde
(1 Yorum)

Hi, I am a data engineer of 5 years experience. I have designed and built large scale spark pipelines for use cases similar to yours. Unfortunately as you might be aware there are no straight forward answer to your pro Daha Fazla

€1000 EUR in 15 gün içinde
(0 Değerlendirme)