
Kapalı
İlan edilme:
Teslimde ödenir
Türkçe İlan Metni Başlık: 500 GB Türkçe Metin Korpusu Temizliği için Veri Mühendisi Projeye dair: Türkçe odaklı büyük bir dil modeli pre-train etmek için 500 GB’lık ham metin arşivimiz var. Temel temizlik işlemlerinin bir kısmı gerçekleştirilmiş olsa da (link/simge/kod ayıklama vb.), veriyi tamamen standardize etmek, hassas içeriği filtrelemek, dil tespiti ve deduplikasyon yapmak, ardından sürdürülebilir bir iş akışı kurmak için deneyimli bir veri mühendisine ihtiyacımız var. Görevler: Ham veriyi (çeşitli kaynaklardan .txt / jsonl vb.) ingest edecek, normalize edecek, UTF-8 / Unicode sorunlarını düzeltecek otomasyon betikleri yazmak. Dil tespiti, içerik filtreleme, kişisel veri maskeleme, deduplikasyon ve layout temizliği için kural tabanlı + gerekirse ML tabanlı pipeline oluşturmak. Kalite ölçütlerini (token sayısı, Türkçe oranı, örnekleme QC raporları) tanımlayıp düzenli raporlamak. NLP eğitim ekiplerine teslim edilecek final dataset (jsonl / parquet) ve dokümantasyon üretmek. Aradığımız kişi: Büyük ölçekli metin temizleme / data wrangling tecrübesi (en az 3 yıl). Python ekosistemine hâkim; pandas, multiprocessing, regex, BeautifulSoup, ftfy, spaCy/FastText tabanlı dil tespiti, vs. Tercihen Spark/Dask veya benzer dağıtık sistem deneyimi. Türkçe dil yapısına aşinalık ayrıca avantaj. Veriden rapor üretebilen (ör. Jupyter, Metabase) ve pipeline’ı otomasyona bağlayabilen (Airflow, Prefect) kişiler tercih edilir. Süre ve çalışma: Beklenen süre: 3 aya kadar (tam zamanlı veya yoğun yarı zamanlı). Uzaktan çalışma mümkün; düzenli checkpoint/demolar bekliyoruz. Ücret: Deneyime göre aylık net 60 000 – 80 000 TL (veya eşdeğer döviz). Daha fazla deneyimli adaylar için bütçe esnetilebilir. Başvurularınızda lütfen: Benzer projelerinizden örnek verin (log temizleme, dataset hazırlama). Kullanmayı düşündüğünüz temel araçları/pipeline taslağını özetleyin. Çalışmaya ne zaman başlayabileceğinizi belirtin. English Job Listing Title: Data Engineer (500 GB Turkish Text Corpus Cleaning for LLM Pretraining) About the project: We are preparing a Turkish-focused large language model and currently hold ~500 GB of raw text. Initial cleaning steps (removing URLs, code, non-text) are done, but we need an experienced Data Engineer to standardize the corpus, perform language detection, deduplication, sensitive-content filtering, and build a reproducible cleaning pipeline ready for pretraining. Responsibilities: Design and implement automated scripts/pipelines to ingest, normalize, and validate mixed-format text data (txt/jsonl). Apply language detection, PII masking, deduplication, whitespace/layout cleanup, and quality filtering (rule-based + optional ML tools). Define quality metrics (token counts, Turkish-language ratio, sampled QC reports) and deliver periodic summaries. Produce the final cleaned dataset (jsonl/parquet) plus documentation for the training team. Requirements: 3+ years of experience in large-scale text/data wrangling projects. Strong Python skills (pandas, multiprocessing, regex, BeautifulSoup, ftfy, spaCy/FastText for language ID). Bonus: experience with Spark/Dask or other distributed frameworks. Familiarity with Turkish language structure is a plus. Comfortable building reporting dashboards/notebooks and automating pipelines (Airflow/Prefect, etc.). Logistics & Compensation: Expected timeline: up to 3 months (full-time or concentrated part-time). Fully remote collaboration is fine; regular demos/checkpoints required. Compensation: ₺60 000–80 000 net per month (or equivalent in foreign currency) depending on experience; higher budgets negotiable for senior profiles. How to apply: Share examples of similar projects (dataset cleaning, NLP corpora preparation). Describe the tools/pipeline you plan to use. Mention earliest start date and availability. Looking forward to meeting data engineers who are passionate about building clean, large-scale Turkish corpora!
Proje No: 39875626
39 teklifler
Uzaktan proje
Son aktiviteden bu yana geçen zaman 4 ay önce
Bütçenizi ve zaman çerçevenizi belirleyin
Çalışmanız için ödeme alın
Teklifinizin ana hatlarını belirleyin
Kaydolmak ve işlere teklif vermek ücretsizdir
39 freelancer bu proje için ortalama $667 USD teklif veriyor

Hi there, I am excited to apply for the Data Engineer position focused on cleaning a 500 GB Turkish text corpus. With over 4 years of experience in large-scale text processing, I have successfully led projects that involved data wrangling and cleaning, including a recent initiative where I implemented a Python-based pipeline that reduced data processing time by 40%. To meet your requirements, I will design an automated pipeline using Python libraries such as pandas and spaCy for language detection, along with regex for content filtering. I will also ensure that quality metrics are established, allowing for regular reporting and documentation of the cleaned dataset in jsonl format. I would be happy to discuss your needs and get started right away. Best regards, Marko
$500 USD 7 gün içinde
3,5
3,5

Hello there,, I have reviewed your requirements and I'm confident that my experience and skills align perfectly with what you're looking for. I'm confident my skills are perfect for the job! As an accomplished AI expert with a strong proficiency in python programming and data processing, I feel confident that I can complete your project brilliantly within the given timeline. My 10+ years of professional experience have equipped me with in-depth knowledge and expertise in various fields of artificial intelligence, especially computer vision, machine learning, deep learning, and image processing -- all of which could greatly benefit your project. Client interaction is another aspect where I strongly focus; together we can optimize not just the Python implementation but the project's overall progression too. Having said that, let's discuss this further and embark on a project journey that exceeds your expectations! Best Regrads.....
$250 USD 2 gün içinde
3,3
3,3

I am excited to tackle the challenge of cleaning and standardizing a massive 500 GB Turkish text corpus for LLM pretraining. With over 3 years of experience in data wrangling, strong Python skills including pandas and spaCy, and familiarity with distributed systems like Spark, I am well-equipped to design automated pipelines for language detection, deduplication, and content filtering. Let's collaboratively ensure this dataset is impeccable and ready for training. Would love to discuss further details and get started on this exciting project! Regards, Araminta (Minted Solutions)
$400 USD 5 gün içinde
2,8
2,8

❤️❤️⭐⭐⭐⭐⭐ Değerli Müşterimiz ⭐⭐⭐⭐⭐❤️❤️ Türkçe metin verilerinizin temizlenmesi konusunda sizinle çalışmaktan büyük mutluluk duyarım. Veri ön işleme, NLP ve metin temizleme (özellikle Türkçe metinlerde) konusunda 9 yılı aşkın deneyimimle, dağınık metinleri analiz veya modelleme için temiz ve kullanılabilir verilere dönüştürme konusunda uzmanım. Sunacaklarım: Python'da istenmeyen karakterlerin kaldırılması, Türkçe harflerin (ç, ğ, ş vb.) normalleştirilmesi, küçük harf kullanımı, boşlukların kaldırılması, noktalama işaretlerinin kaldırılması veya düzeltilmesi ve durdurma sözcüklerinin kaldırılması gibi görevleri gerçekleştiren kapsamlı bir temizleme hattı. Anlaşılır dokümantasyon: Kod, modüler olacak ve üzerinde değişiklik yapabileceğiniz işlevler (örneğin, belirli temizleme adımlarını devre dışı bırakma) ve örnek betikler; temizlenmiş ve ham çıktıları doğrulayabilmeniz için test senaryoları içerecek. Neden Beni Seçmelisiniz: Türkçe NLP projelerinde (örneğin, metin temizleme kütüphaneleri, akademik veya uygulamalı çalışmalar) derinlemesine uygulamalı deneyim. Python, Pandas ve düzenli ifadeler konusunda sağlam bilgi ve tekerleği yeniden icat etmekten kaçınmak için mevcut açık kaynaklı araçları (örneğin, Türkçe ön işleme mikro servisi veya metin veri temizleme araçları) kullanma. Yakında başlamaya hazırım. Türkçe metin verilerinizi temiz, tutarlı ve bir sonraki veri bilimi adımına hazır hale getirelim. Saygılarımızla,
$7.000 USD 27 gün içinde
2,9
2,9

Hi there, hope you’re doing well. This project sounds right up my alley. I have solid experience in large-scale text data cleaning and NLP pipeline development, and I can help you turn your 500 GB Turkish text corpus into a clean, structured dataset ready for model training. My approach would include normalizing and validating all text files, detecting and filtering out non-Turkish content, removing duplicates, masking sensitive information, and generating quality reports with clear metrics. I usually build these pipelines in Python using tools like pandas, spaCy, FastText, and Dask to ensure efficiency and scalability. The final result will be a fully reproducible workflow that can be rerun or extended easily, along with documentation and the final dataset in your preferred format. I’ve worked on similar corpus preparation projects for multilingual LLM training, so I understand the importance of data consistency and high-quality filtering. I’m ready to dive in and start as soon as you’d like. Looking forward to collaborating with you on this. I am ready to dive in. best regards. Roman
$500 USD 7 gün içinde
2,3
2,3

Hi, I’m Ihor, a full stack developer with 12 years of experience. I’d love to work on your project and I’m confident I can get it done the right way. I’ve handled similar projects before, so I understand what’s needed here. My plan is simple: clear communication, reliable work, and delivering results on time—usually earlier than the deadline. I can start right away and will keep you updated throughout the process so everything runs smoothly. Let me know if you’d like to go over any details. I’m ready when you are. Best, Ihor
$500 USD 4 gün içinde
1,6
1,6

I am a perfect fit for your project, focusing on cleaning and preparing a 500GB Turkish text corpus for language model pretraining. With expertise in Python (pandas, regex, spaCy), data wrangling, and ML pipelines, I ensure efficient normalization and deduplication processes. While new to Freelancer, I bring extensive off-site experience to the table. I would love to chat more about your project! Regards, Christian Wilkat
$400 USD 3 gün içinde
1,0
1,0

⭐⭐⭐⭐⭐Dear client, I am the best candidate for your 500 GB Turkish text corpus cleaning project, bringing extensive experience in large-scale data wrangling, NLP preprocessing, and pipeline automation for LLM pretraining tasks. I specialize in Python (pandas, regex, multiprocessing, BeautifulSoup, ftfy, spaCy, FastText) and have worked on multiple multilingual corpus-standardization workflows including deduplication, sensitive-content filtering, PII masking, and language identification. I can design a robust, reproducible data-cleaning pipeline integrated with Airflow or Prefect, produce quality metrics (token counts, Turkish ratio, QC samples), and deliver a clean dataset in jsonl/parquet formats ready for model training. With proficiency in Spark/Dask distributed processing and deep familiarity with Turkish linguistic structures, I guarantee efficiency, accuracy, and transparency through regular reports and checkpoints. Looking forward to your response. Best regards.
$250 USD 7 gün içinde
0,6
0,6

If you don't like it, you don't pay and walk away with a free consultation. "I am a perfect fit for your project, focusing on cleaning and preparing the 500 GB Turkish text corpus for language model pre-training." With expertise in Python (pandas, regex, spaCy) and data wrangling, I ensure seamless data normalization and automated pipeline creation. While new to Freelancer, I bring advanced experience in text cleaning and dataset preparation. I would love to chat more about your project! The worst that can happen is you walk away with a free consultation. Regards, L.J. MOMSEN
$400 USD 14 gün içinde
0,0
0,0

I'm De Wet let's transform your vision into reality. Your project needs a data engineer experienced in cleaning and standardizing a 500GB Turkish text corpus. I specialize in automating data wrangling, language detection, and creating efficient pipelines. While new to freelancer, I have extensive off-site experience in similar projects. I would love to chat more about your project! Regards, De Wet
$250 USD 3 gün içinde
0,0
0,0

YOUR SEARCH ENDS HERE, Thomas. I specialize in crafting clean, modern, and high-performing digital solutions that truly make an impact. With experience delivering projects for clients beyond Freelancer, I am excited to take on the challenge of cleaning and standardizing the 500 GB Turkish text corpus for your LLM Pretraining. My expertise in data wrangling, Python, and building automated pipelines align perfectly with your project requirements. I am eager to bring your vision to life, ensuring a seamless and efficient process from start to finish. Let's collaborate to create a top-notch dataset that sets the stage for your language model's success. Warm regards, Thomas
$400 USD 14 gün içinde
0,0
0,0

Subject: Perfect Fit for Your Project: Data Engineer Specializing in Cloud Services and Web Development Dear KairaAC, I am the ideal candidate for your project as I excel in creating clean, professional, user-friendly, seamless, integrated, and automated solutions - precisely what you are seeking. With my expertise in Cloud services and Web Development, I guarantee technically solid projects that cater to end users' needs flawlessly. Though new to freelancer.com, I have a proven track record of delivering successful projects off-site. I am enthusiastic about discussing your project in greater detail to understand your requirements fully. My experience and skills align perfectly with the scope of work you have outlined. Looking forward to the opportunity to collaborate with you on this exciting project! Warm regards, Berto Agenbag
$400 USD 14 gün içinde
0,0
0,0

Hi there, This is Nazar. Your ambitious goal of producing a clean, standardized 500 GB Turkish text corpus for LLM pretraining immediately caught my attention. For a project of this scale, I’d approach it with a modular pipeline—leveraging pandas and multiprocessing for core preprocessing, ftfy for Unicode fixes, spaCy/FastText for Turkish language detection, and integrating deduplication plus PII masking as discrete, testable stages. Reporting would be woven in via notebook-based QC dashboards and token counting per batch. In a previous project, I engineered automated scripts to clean and deduplicate a 300 GB multilingual news dataset, preparing it for downstream NLP—so I’m comfortable scaling and setting up reproducible, documented workflows. I can outline the entire pipeline before kickoff and am available to start within a week. Happy to answer any questions or share project samples. Best regards, Nazar
$500 USD 2 gün içinde
0,0
0,0

This project fits right in with what I do best. I understand the need for a skilled Data Engineer to handle the cleaning and standardization of a massive 500 GB Turkish text corpus. My experience in data wrangling, Python, and building efficient pipelines align perfectly with your requirements. My expertise lies in delivering high-quality, modern, and efficient solutions tailored to each client’s needs. While I’m new to Freelancer, I bring a wealth of experience from numerous successful projects completed off-platform. I’d be happy to discuss your project in more detail. Regards, Gabriel
$400 USD 14 gün içinde
0,0
0,0

If my work doesn’t meet your expectations, you don’t owe a cent. Worst case? You walk away with valuable insights at no cost. I understand the need for meticulous data cleaning evident in your 500 GB Turkish Text Corpus project. With over three years of experience in large-scale data wrangling, Python proficiency, and a knack for building automated pipelines using tools like pandas, spaCy, and Airflow, I am well-equipped to tackle the tasks at hand. While I am new to freelancer.com, I have tons of experience and have done other projects similar to this off site. Ready to start and deliver results that speak for themselves. Regards, Tristan Scheepers
$400 USD 14 gün içinde
0,0
0,0

I would love to help you with this project. I’ve gone through your job description and I understand you’re looking for a data engineer experienced in cleaning a 500GB Turkish text corpus for language model pretraining. I can deliver precise data standardization, language detection, deduplication, and establish a robust cleaning pipeline aligned with your requirements. With my skills in large-scale text cleaning and data wrangling, Python proficiency (pandas, regex, spaCy), and experience in Spark/Dask, I focus on delivering high-quality work effectively. I have successfully completed similar projects off-site. I would love to chat more about your project! Regards, Enrique Strauss
$400 USD 14 gün içinde
0,0
0,0

✔️500 GB Türkçe Metin Korpusu Temizliği için Veri Mühendisi✔️ Merhaba, ilanınızı detaylıca inceledim. 500 GB’lık Türkçe ham metin korpusunun pre-train öncesi temizliği, standardizasyonu, dil tespiti ve deduplikasyonunu yapabilecek deneyimli bir veri mühendisi arıyorsunuz. Projeniz, hem kural tabanlı hem de gerekirse ML tabanlı pipeline tasarımıyla sürdürülebilir bir veri hazırlık süreci gerektiriyor. Yakın zamanda benzer bir proje tamamladım: 300 GB çok dilli metin verisini Türkçe odaklı hale getirdim; dil tespiti için FastText, deduplikasyon için MinHash tabanlı yaklaşımlar ve otomatik QC raporlama sistemleri kurdum. Pipeline’ı Python + Dask + Airflow ile otomatikleştirdim, ftfy ve BeautifulSoup kullanarak encoding ve HTML temizlik sorunlarını çözdüm. Deneyimime göre bu tür projelerde en zorlu kısım, **dil tespiti ile deduplikasyonun performans ve doğruluk dengesini kurmaktır.** 500 GB gibi büyük bir veri setinde, bellek optimizasyonu ve veri akışının paralel yönetimi kritik hale gelir. Özellikle Türkçe özel karakterler, kodlama hataları ve benzer metin varyasyonları için özel regex ve normalizasyon stratejileri gerekir. Geniş ölçekli veri temizleme ve NLP pipeline’larında uzmanlığım ile projenizi başarıyla tamamlayabilirim. Güvenebilirsiniz. Proje yapısını, örnek veri biçimlerini ve kullanılacak altyapıyı detaylı görüşmek isterim. Doğru bir başlangıç planı ile 3 ay içinde teslim edilebilir, üretim seviyesinde sürdürülebilir bir pipeline oluşturabiliriz.
$500 USD 7 gün içinde
0,0
0,0

We are the perfect fit for your project. With expertise in large-scale data wrangling, we understand the need for clean, professional Turkish text processing. Our skills in Python, data normalization, and pipeline automation align perfectly with your requirements. While new to Freelancer, we bring years of hands-on experience from off-site projects. I would love to chat more about your project! Regards, Stephan
$400 USD 30 gün içinde
0,0
0,0

Merhaba! LLM öntreningi için veri temizliğinin önemini anlıyorum. Benzer hacimli Türkçe metin verisini işleyerek dil modeli geliştiren bir ekibe destek vermiştim. Veri setini karakter kodlama sorunları, HTML etiketleri ve anlamsız tekrarlar gibi etkenlerden arındıracağım. NLTK ve spaCy kullanarak tokenizasyon, normalizasyon ve stop words temizliği yapacağım. Tekrarları tespit etmek için gelişmiş hash algoritmaları kullanacağım. Veri setindeki olası kişisel bilgileri anonimleştirmek için ne gibi bir yaklaşımınız var? Temizlenmiş verinin hangi formatta teslim edilmesini istersiniz? Detayları konuşmak için müsaitim.
$597,45 USD 21 gün içinde
0,0
0,0

I specialize in data engineering and offer a tailored solution for cleaning and standardizing your 500 GB Turkish text corpus for language model pre-training. Have you considered leveraging machine learning algorithms to enhance the efficiency of your data cleaning process? With over three years of experience in text data wrangling, proficiency in Python (pandas, regex, spaCy), and a track record of delivering high-quality results, I guarantee a seamless and professional approach. My track record comes from years of successful work outside Freelancer. To establish myself here, I’m giving you the advantage of discounted pricing while still delivering the same high-quality results my clients trust me for. In exchange for a good review. When can we discuss further the specifics of your project requirements? Milestones: 1. Data ingestion and normalization 2. Language detection and content filtering 3. Deduplication and quality metrics 4. Final dataset delivery and documentation 5. Review, feedback, and final adjustments I’d love the chance to explore your project further. Even if we don’t move forward, you'll still walk away with valuable insights at no cost. Kind Regards, Ruan
$400 USD 7 gün içinde
0,0
0,0

Izmir, Turkey
Eki 12, 2025 tarihinden bu yana üye
$30-250 USD
₹750-1250 INR / saat
₹12500-37500 INR
£250-750 GBP
$250-750 USD
₹12500-37500 INR
₹100-400 INR / saat
$10-30 USD
$250-750 USD
€250-750 EUR
₹12500-37500 INR
₹12500-37500 INR
₹12500-37500 INR
$10-30 AUD
$15-25 AUD / saat
$30-250 USD
$250-750 USD
$3000-5000 USD
₹37500-75000 INR
$30-250 USD