CANNOT SCALE BIG DATA PROCESSING
Bütçe €30-250 EUR
I built an ETL pipeline to process terabytes of data. To achieve that goal, I setup a Spark Cluster (Scala) and MinIO server for object data storage.
I can process and save 200 gigabytes in roughly 30 minutes using 10 virtual machines, for Spark Processing.
The issue I have is that I am not able to scale that Processing. Meaning if I double the number of spark virtual machines, this does not affect processing time.
I need a Data Architect who has enough expertise to help me identify the bottleneck and fix the issue.
• I use virtual machines set up on-premises using VMWare ESXi 6
• Physical machines (which host VMs) are on a 1 GB network.
• There is no over commitment for vCPU nor RAM
• Spark VMs. 16VCPU, 64 GB RAM
• MinIO (Storage). 16vCPU, 64GB RAM, Configured using RAID0
SOME DETAILS ABOUT DATA PROCESSING
The process is straight.
• Read data from 2 sources on MinIO,
• Make a Union of data of two sources,
• Filter out empty values on a column from resulting dataset,
• Apply 2 groupby on that column (We save intermediate values after the first groupby)
• Union the dataset obtained after the groupby operation with the empty columns values
• Save the whole again on MinIO
Bu iş için 5 freelancer ortalamada €334 teklif veriyor
Hi there,I am excited to share my expertise and skills in data engineering and Big data, which I have acquired over the past 3 years. I am confident that I can meet your requirements. I would be delighted to work with Daha Fazla
Hi there, How are you? I have gone through your project details. I would like to tell you that l have a great bunch of experience in VMware, Spark, Data Engineer, Big Data and Amazon S3. For that I would require from Daha Fazla
Hi Saint Denis, I am a Data Engineer with 7+year of experience. I would like to offer you help to fix this issue. Please let me know if we can connect .
Hi, I hv ,,10 years of exp in this. I would like to work for you. As i have already did the similar task and supported many projects/person in the same way etc. I would like to hear from your side. Thank you for
Hi, I am a data engineer of 5 years experience. I have designed and built large scale spark pipelines for use cases similar to yours. Unfortunately as you might be aware there are no straight forward answer to your pro Daha Fazla