looking for a perl programmer that will write a script that will parse data from an 800mb-1.5GB text file.
this script will need to
* be readable and well commented. i can write some perl, but i'm far from an expert. i would like the ability to add on and use this script far into the future. this perl script will be run via a cron job at night.
* be extremely fast. we will test this on 2-3 gig files daily.
i have my own script, but it performs very slow and inefficiently. i need someone to write something from scratch that will be much faster. if my script runs faster than your script, i will send it back to you for reworking. obviously, that'll be a waste of time, so think carefully about the algorithm and your approach before writing anything.
* accept the data filename from the commandline
e.g. ./[url removed, login to view] [url removed, login to view]
* each chunk of data in the text file is separated by a special group of characters on its own line. the chunk of data is variable in length. We will forward u an example text file upon project start. in the meantime, here's a generic data layout. there are 3 data fields in each chunk. the first two data fields are each on their own single line. the third data field is multi-line and can range from 1 line to 40 lines or more. the special group of characters then separate the data chunk from the next one.
* the script will match for various keywords regardless of case in the third data field. i've done this by uppercasing the text and matching for these. i will provide these keywords to you. these keywords are subject to later change and future modification, so i need the ability to add/remove/edit them easily.
* each single keyword is part of a single keyword group and we will need certain things done for each group. a match of a keyword in a single group means that part of the first field in the data chunk will be copied and written to a single text file which i can name. i imagine each keyword group will have 10-30 keywords/phrases to match for.
* the script will then proceed to the next group with it's own unique and different keywords and check the entire text file as well. it will then write the first field of the data chunk to another text file. i would like only part of first data field to be written, although i'd like the option to include the other fields.
* the script will proceed to another group of keywords and copy the first field to another text file. i would like only part of the first data field to be written, although i'd like the option to include the other fields.
* lastly, the script will run through and look for data chunks which haven't matched any of the keywords in any of the keyword groups and write all 3 fields to another text file.
* after the scripts have finished running, i would like a report that will send via sendmail a report to me on the total number of data chunks matched in each group and keyword. i do not want to install additional perl modules so please use as few perl modules as possible. in addition, i'd like a copy of the report on the server, just in case email doesn't get sent or doesn't arrive. it may be overwritten everytime the script is run.
Group 1: 100
Group 2: 1232
Match 1: 23
Match 2: 343
Match 3: 123
Match 4: 23
Group 3: 13343
No Match Group: 1232
Before payment will be sent, I will test this on real data for verification and speed. Testing will take a day or two depending on how fast your script runs...
in your bid, please state your experience with perl, programming experience, ability to write clean and fast executing code.
payment will be via paypal or GAF.
this project is not difficult for a perl guru and shouldn't take too much time.
feel free to ask questions.