
Open
Posted
•
Ends in 2 days
I’m moving our AI inference off OpenAI and onto a Tesla P100 16 GB box that already runs qwen2.5 7B/14B on Ollama. The backend is wired to switch between local and remote per-prompt, so the infrastructure work is minimal; the real task is model selection, tuning and validation. Cost reduction is the driving motive, but I will only flip a prompt when the dashboards show zero drop in the accuracy of CV match scores (real interview rate is the secondary check). We have four production prompts; each will run in shadow mode until its metrics are indistinguishable from the current OpenAI baseline. What I need from you • Pick or fine-tune the best qwen2.5 variant—or another model that will fit and perform on a single P100—then set quantisation, context window and batching so latency stays reasonable. • Run shadow tests, analyse the match-score deltas and recommend when we should switch traffic. • Repeat for all four prompts, one at a time, over roughly 6–10 weeks at 15–25 hours per week. To apply, tell me: 1. Your hourly rate. 2. One similar migration or on-prem LLM project you’ve shipped. 3. The first change you’d make given a P100 and qwen2.5. If you thrive on squeezing the most out of limited GPU memory while keeping quality rock solid, I’d love to hear your plan.
Project ID: 40486338
96 proposals
Open for bidding
Remote project
Active 2 days ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
96 freelancers are bidding on average £16 GBP/hour for this job

Hello, I trust you're doing well. I am well experienced in machine learning algorithms, with nearly a decade of hands-on practice. My expertise lies in developing various artificial intelligence algorithms, including the one you require, using Matlab, Python, and similar tools. I hold a doctorate from Tohoku University and have a number of publications in the same subject. My portfolio, which showcases my past work, is available for your review. Your project piqued my interest, and I would be delighted to be part of it. Let's connect to discuss in detail. Warm regards. please check my portfolio link: https://www.freelancer.com/u/sajjadtaghvaeifr
£15 GBP in 40 days
7.2
7.2

Hi there, I understand you’re migrating your AI inference from OpenAI to a local Tesla P100 16GB environment running Qwen2.5 models via Ollama, with a focus on reducing cost while maintaining parity in CV match-score performance across four production prompts in shadow mode before full cutover. My approach will be to structure this as a controlled model optimisation and validation pipeline using your existing routing layer (local vs remote switching). I will first benchmark the current OpenAI baseline outputs, then select the most suitable Qwen2.5 variant (or alternative lightweight model) for the P100 constraints, followed by careful quantisation tuning (likely 4-bit/8-bit depending on quality vs latency trade-offs), context window optimisation, and batching configuration to stabilise throughput without degrading response fidelity. From there, I will implement a shadow evaluation framework where each of the four prompts runs in parallel against OpenAI outputs. Using your CV match-score metrics as the primary signal, I will analyse distribution drift, error cases, and semantic divergence. Once parity thresholds are consistently met, I will recommend phased traffic switching per prompt, ensuring no regression in downstream “real interview rate” validation. Before starting, are your current match-score evaluations already logged per prompt, or will we need to instrument additional tracking in the inference layer? I’m ready to begin immediately. Warm Regards, Aneesa.
£10 GBP in 40 days
6.1
6.1

Hello. I am doing many similar tasks (NDA, sorry) on even older and weaker home GPU trying to do 2 things in parallel: 1) Very specific prompt optimization/translation (my know-how, sorry) 2) Selection - light tweak of LLM to steady run. I you case I've already have good pre-selected candidates that can run on 16Gb efficiently. Thus, its up to you. I can start immediately. Regards,
£16 GBP in 40 days
5.7
5.7

As a highly skilled and experienced full-stack developer, I am confident in my ability to effectively manage and optimize your LLM migration project. My expertise in Java, C++, Python, and machine learning makes me well-suited to assess, fine-tune, and select the ideal qwen2.5 variant or alternative model that would perform compatibly on your Tesla P100 16 GB box. Furthermore, my fluency in HW/SW AI and my ability to extract, process, and analyze data are pivotal skills needed to run efficient shadow tests and identify any match-score deltas for informing prompt switches. In regards to my previous projects of similar nature, I specifically recall a successful migration of an on-prem LLM model where I had to transfer the inference engine from OpenAI's infrastructures onto a local hosting facility with constant monitoring, alongside assessing costs and maintaining performance. This project drew various parallels with yours in terms of cost reduction being the main goal but ensuring there was no compromise on quality or performance.
£10 GBP in 40 days
5.6
5.6

Hello, I understand your operational goal: to migrate four key prompts from OpenAI to a self-hosted LLM on a P100 box to cut costs. The process is a careful A/B test, running a local model like qwen2.5 in shadow mode and only making the switch when its CV match scores are proven to be indistinguishable from the live OpenAI baseline. My proposed rate is £15/hr. Technical approach: My first step would be to benchmark the qwen2.5-14B model with 4-bit quantization. This immediately tells us if the model is viable on the P100 in terms of latency and VRAM before we commit to it. I will then systematically evaluate other candidates (e.g., Mistral 7B) and quantization levels (Q4_K_M, AWQ) to find the best fit. I'll establish a robust logging pipeline to capture outputs from both models and automate the delta analysis of your match scores to guide the switch-over decision for each prompt. Relevant systems: We recently integrated a custom LLM on resource-constrained hardware for our Alpha Robot project, which required similar performance and memory optimization. This structured process ensures we meet your zero-accuracy-drop requirement for each prompt migration. I have a few questions for you on the clarification board. Regards, Rohit
£10 GBP in 60 days
4.6
4.6

Hi, I am a AI / LLM Engineer with 8 years of rich experience. I am familiar with Ollama, Qwen2.5, Python, Machine Learning, and AI Model Integration. For this project, the most important part is maintaining the same CV match accuracy while reducing inference costs. I can evaluate and optimize Qwen2.5 models, tune quantization and inference settings, and run shadow testing against the current OpenAI baseline before recommending traffic migration. I will focus on achieving stable quality with efficient GPU usage on the P100. I'm an individual freelancer and can work on any time zone you want. Please contact me with the best time for you to have a quick chat. Looking forward to discussing more details. Thanks. Emile.
£15 GBP in 40 days
3.4
3.4

Hi, Krishna here. We are a team of 20+ engineers, have completed 300+ projects with 4.7 rating. We have recently done a similar project and would like to chat and discuss. As an AI specialist with a broad range of skills, my experience aligns perfectly with the unique challenge you've outlined. Our team has migrated several large-scale natural language processing (NLP) projects and successfully deployed them on-premises, similar to your requirement. One such project we undertook involved migrating predictive analytics, NLP, and computer vision frameworks for a major e-commerce platform to improve demand forecasting, streamline language understanding, and automate quality inspection respectively. Given a Tesla P100 GPU and the qwen2.5 OpenAI model, the first change I would propose is fine-tuning the model parameters to maximize efficiency without sacrificing quality. The key here is optimizing quantisation (model size), context window (memory usage), and batching (pipeline throughput) while ensuring reasonable latency.
£13 GBP in 40 days
3.6
3.6

Hi, The first thing that stood out to me is that this isn't really a model migration project—it's an evaluation project. Moving prompts to a local model is easy. Proving that match-score quality remains unchanged is the part that determines whether the migration succeeds or fails. One question I had while reading the scope: are the current CV match scores generated directly by prompts, or is there an additional scoring layer sitting on top of the model outputs? Given a single P100, the first thing I'd investigate is whether the 14B model is actually providing measurable gains on your evaluation set versus an optimized 7B setup. In many cases, latency and throughput penalties outweigh the quality improvement, especially when the success metric is score consistency rather than general reasoning ability. I have worked on local inference workflows, model evaluation pipelines, and LLM integrations where maintaining output quality was more important than simply replacing an API. I'd be interested in learning more about how you're currently measuring score drift and promotion criteria during shadow testing. Looking forward to hearing more.
£10 GBP in 20 days
3.2
3.2

Hi There!!! ★★★★ (Optimizing on-prem LLM performance on P100 while matching OpenAI baseline accuracy) ★★★★ Project understanding: I understand you are migrating inference from OpenAI to a local Tesla P100 setup using Ollama with Qwen2.5 models. The goal is to reduce cost while maintaining zero drop in CV match-score quality across 4 production prompts using shadow testing before switching traffic. ⚜ Model selection & fine-tuning for Qwen2.5 (or alternative) optimized for P100 constraints ⚜ Quantization, context window tuning, and batching optimization for stable latency ⚜ Shadow testing pipeline to compare OpenAI vs local outputs with scoring analysis ⚜ Prompt-by-prompt migration strategy with validation before production cutover ⚜ Performance monitoring dashboards for match-score delta tracking ⚜ GPU memory optimization and inference speed tuning on single P100 setup ⚜ Risk-free rollout strategy with fallback routing logic I have experience working on LLM optimization, model deployment on limited GPU environments, and inference pipelines using Python, HuggingFace, and Ollama-based setups. My approach would start with profiling your current prompts, then benchmarking Qwen2.5 variants on your exact workload, followed by structured shadow testing with automated scoring comparison to ensure no regression before switching traffic. Warm Regards, Farhin B.
£11 GBP in 40 days
3.8
3.8

Hi, I've worked on integrating various AI models into existing workflows, optimizing them for performance and cost-effectiveness. For example, I successfully migrated a CRM system to use a more efficient model, reducing costs without compromising on accuracy. Given your project, I’d start by fine-tuning the qwen2.5 variant to fit your P100 constraints, ensuring minimal latency and optimal performance. Let's begin with a small test to align on specific requirements before diving into the full project. Best Regards, Ivica
£13 GBP in 40 days
2.7
2.7

With my extensive experience in Data Analytics and Science spanning over 8 years, I uniquely understand the significance of what you need, and how precise data-driven decisions can significantly impact the bottom line. My specialty in 'data storytelling', 'predictive analytics' and 'end-to-end data solutions' makes me an ideal fit for your self-hosted LLM prompt migration project. Having worked across diverse projects involving **Python (Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn)**, TensorFlow/PyTorch and a wide range of ETL tools including Apache Airflow, Talend, and Azure Data Factory among others positions to pick or fine-tune the best qwen2.5 variant on a Tesla P100 16 GB box. To further optimize GPU memory usage and maintain quality at expectedly minimal latencies; I'll employ my expertise in leveraging the available quantisation techniques, context windows tweaking and intelligent batching strategies. Importantly, I have hands-on experience with predictive modeling optimization exercises similar to yours. I successfully helped a major e-commerce client improve their inventory forecasting by implementing an on-prem AI solution that reduced hosting costs by 45%. I’m ready to leverage this experience at its best for your project. For hourly rate information: let's discuss it once.
£13 GBP in 40 days
2.6
2.6

Hello, The real challenge here is optimizing the qwen2.5 model to fit the constraints of the Tesla P100 while ensuring performance remains high. The first step will be to select the best variant and fine-tune it for the specific prompts. Quantization and batching will be crucial to manage latency effectively. Shadow testing will help compare the new model's performance against the OpenAI baseline, focusing on match-score deltas. What are the specific metrics you want to track during shadow testing? Are there any existing tools you prefer for performance analysis? Ready to start and ensure a smooth transition to the new model.
£13 GBP in 40 days
2.7
2.7

As an experienced Full Stack Developer with a strong background in PHP, Laravel, and React JS, I believe I am the perfect candidate for your self-hosted LLM prompt migration project. My six years of versatile experience covers a range of areas including AI Chatbot Development, AI Development, and Machine Learning, which are directly in line with your project's needs. Regarding similar projects I've delivered in the past, one that comes to mind is a migration project involving an on-prem deployment of a large language model very similar to the task at hand. In this case, we successfully moved the workload onto a single GPU while maintaining the expected level of performance. Given access to a Tesla P100 16GB box and keening on having maximum GPU memory optimization - I would first fine-tune qwen2.5 model variant to optimize it for performance with limited resources using quantisation techniques along with precisely setting context window and batching sizes to ensure controlled latency while maintaining the desired quality of outputs. Combining my penchant for performance optimization plus my commitment to delivering results that matter most for your business makes me uniquely positioned to deliver outstanding results for this project. Let's make it happen!
£10 GBP in 2 days
2.2
2.2

Hello, I can help you migrate your AI prompts from OpenAI to your local P100 box. Approach: • Pick the best qwen2.5 variant for your P100, set quantisation and context window to keep latency low • Run shadow tests on each prompt, analyse CV match-score deltas, recommend switch timing • Repeat for all four prompts over 6–10 weeks Technologies: • Ollama, qwen2.5 7B/14B, Tesla P100 16 GB, Python, model quantisation and tuning Extras: • Full validation that metrics match OpenAI baseline before flipping • Brief documentation on model choices and tuning decisions Timeline: • 6–10 weeks at 15–25 hours per week My hourly rate is £15 GBP. One similar project: I migrated a production NLP pipeline from a cloud API to a local RTX 3090, tuning a Llama model for resume parsing while keeping accuracy within 1% of the original. First change: I’d test qwen2.5 7B at Q4_K_M quantisation with a 4096 context window, then run shadow mode on the first prompt to compare match-score distributions. Ready to get started. Agustin
£15 GBP in 40 days
2.0
2.0

Hello I’m very interested in helping you migrate your AI inference from OpenAI to the Tesla P100 box running qwen2.5. I understand the key challenge is maintaining match score accuracy while optimizing model selection, tuning, and latency within limited GPU memory. My hourly rate is 12 GBP. I recently completed a project migrating an on-prem LLM to NVIDIA hardware where I fine-tuned models with quantization and optimized batching to balance latency and quality. Given the P100 and qwen2.5, my first step would be to evaluate quantization levels and context window size to maximize throughput without accuracy loss. Could you share more about your current shadow testing setup and metrics tracking? Best regards, AbdulHamid
£12 GBP in 40 days
1.6
1.6

✋ Hi There!!! ✋ THE GOAL OF THE PROJECT:- OPTIMIZE AND MIGRATE PROMPT INFERENCE FROM OPENAI TO SELF HOSTED QWEN MODEL ON P100 WHILE MAINTAINING ACCURACY AND PERFORMANCE I have carefully reviewed your requirement and understand you are shifting production prompts to a local LLM setup with strict accuracy parity validation and cost optimization focus. I am the best fit because I specialize in LLM optimization, on-prem inference tuning, and performance benchmarking for production AI systems. 1. Model selection and tuning of Qwen2.5 variants for P100 VRAM constraints and latency control 2. Quantization, context window optimization, batching and inference performance tuning 3. Shadow testing with metric comparison, accuracy validation and safe traffic migration strategy I provide model optimization, testing, benchmarking, deployment tuning, and full documentation delivery. I have 9+ years experience as a full stack developer and worked on LLM deployment and inference optimization systems. Looking forward to chat with you for make a deal Best Regards Elisha Mariam!
£13 GBP in 40 days
1.4
1.4

Given the description of your project, I believe I am the perfect fit to help you achieve your goals. My experience as a full stack developer with a focus on AI chatbot development, will contribute significantly to this migration process. Having worked with technologies such as JavaScript, TypeScript, Python, and AI APIs including OpenAI API in my past projects, I am quite comfortable with the system architecture you've described. Regarding a similar project, I have successfully migrated an LLM system from an external service to an on-premise environment while optimizing for memory utilization and maintaining performance. This project was essentially about selecting the most appropriate model that could fit and perform effectively within limited resources (similar to what you’re about to do). I utilized my deep understanding of different models and memory management techniques to extract the maximum performance out of a single P100. Given the P100 and qwen2.5 for your project, I'd first carry out detailed tests and analysis to reveal the GPU memory usage thresholds per prompt in comparison with current baselines. Then, by strategizing quantization, context window, and batching carefully while keeping latency in check, isolating issues during shadow mode, I would ensure there's zero fall-off in CV match scores before any traffic is switched. Working on these lines over 6-10 weeks at 15-25 hours/week is well within my capability
£25 GBP in 22 days
0.0
0.0

I can help migrate your AI inference from OpenAI to your Tesla P100 setup running qwen2.5 7B. I’ve managed similar self-hosted LLM transitions, ensuring smooth integration and optimized performance on specialized hardware. My approach will focus on adapting your prompts and workflows to the new environment while maintaining inference speed and accuracy. Do you have existing prompt templates, or will they need to be created from scratch?
£12.50 GBP in 7 days
0.0
0.0

Hello, When it comes to implementing large language models, experience and precision are non-negotiable. Having single-handedly migrated several projects to on-premise LLMs, I know exactly how critical every detail is in order to ensure smooth transitions. The Tesla P100 and qwen2.5 certainly ring a bell; in my recent project, I led a team that not only deployed a similar architecture - switching between local and remote per-prompt - but also perfected model selection, tuning, and validation to yield top-notch results as you seek. What distinguishes me from other bidders is my ability to provide backend solutions that gracefully handle real-world pressure by having an almost-zero throughput loss. As expected from this migration, cost reduction is pivotal for you; my focus on optimizing GPU memory while keeping quality intact is well-witnessed and rightly aligned with your requirements. To make the most of the given hardware and qwen2.5 variant, my first move would be quantisation setup followed by correctly defining context window size. Furthermore, my extensive experience with other relevant technologies (like infrastructure setup) will exceedingly contribute towards successful delivery of this project within the suggested time frame while employing productive hours at maximum. Thanks!
£37 GBP in 31 days
0.0
0.0

Hello, My name is Mathis, and as a seasoned AI and Full-Stack Developer, I am well-equipped to help you achieve your goal of migrating your AI inference off OpenAI. With over 7 years of experience in the field, I have completed numerous projects that demanded a delicate balance between resource optimization and high-performance outputs, skills that would be uniquely valuable for your project. Consequently, I have developed expertise in squeezing the most out of limited GPU memory while maintaining quality - precisely what your project needs. Not only do I write code, but I also understand the importance of aligning technology with business goals, ensuring efficient solutions tailored to your specific needs. In terms of specific experiences reminiscent of this project, I have successfully completed similar projects involving on-prem LLM projects and complex migrations. One notable example was a project where I aligned a client's LSTM Language Model (LLM) with hardware similar to what you are deploying, a Tesla P100. This approach allowed me to not only acclimatize the model on the new infrastructure but also ensured reliable performance across intensive usage cycles - an invaluable experience for this prompt migration project. Given a Tesla P100 and qwen2.5, my first change would be streamlining the quantisation process and adjusting the context window and batching to maintain low latency without compromising on throughput and perform Thanks!
£37 GBP in 22 days
0.0
0.0

City of London, United Kingdom
Member since Jun 2, 2026
₹600-1500 INR
₹600-3000 INR
$30-250 USD
$250-750 USD
$1500-3000 USD
₹12500-37500 INR
$30-250 USD
₹100-400 INR / hour
€12-18 EUR / hour
$12-100 AUD / hour
₹1500-12500 INR
$30-250 USD
₹12500-37500 INR
₹600-1500 INR
₹1500-12500 INR
$250-750 USD
$250-750 USD
$30-250 USD
$2-8 USD / hour
$15-25 USD / hour