Looking for a python developer / data engineer
should have experience ingesting and processing data as a stream
demonstrable experience handling 2-3 GB of source data
**knowledge of object oriented programming concepts, professional documentation methods and python lambda functions are a must
Oracle VM box, linux Ubuntu 14.06.5 LTS, pycharm, Anaconda environment
Data is available as TSV extracts from multiple sources in CDL. Data Engineer should be able to merge the TSV extracts by means of applying correct join techniques. As the data will be available in compressed format, data engineer should apply right techniques such as reading data in a streams rather than reading the entire uncompressed format of data - as it might not fit the entire memory. Hence optimal coding is expected. The merged data will be transformed and stored in a postgreSQL data base ([login to view URL]).
The function should follow Object Oriented Paradigm with continuous integration and deployment in focus. Also version controlling is expected.
- Each data snapshot can contain multiple headerless main data files in TSV format, with each file having a size of up to 2GB. Engineer should be able to read files as a stream while unpacking them, because they usually do not fit into RAM.
- In addition to the main data files, each snapshot has a file with the header names and multiple lookup files that map the numeric IDs from the main data to Strings, comparable to a foreign key in an SQL DB.
- Data should be read and transformed on a record by record base (stream or mini-batch processing).
- Each combined and transformed record should be prepared for multiple data sinks, e.g. SQL query strings to write a record into a PostgreSQL, MS SQL. Engineer will create code for a write adapter for each data sink with a common interface so that the same function call can used to write into any of the specified data sinks.
*** Code provided should be modular, reusable and well documented. Engineer needs to know how to build Python modules with classes, using OOP decomposition practices, inheritance (e.g. abstract classes).
- Code should have Unit Tests, if appropriate
- Code will be implemented as a Python AWS Lambda function. Engineer should be familiar with building Lambda functions and should ideally have a local development environment, setup for building and uploading Lambda functions.