I have a large amount of intraday stock data for over 600 stocks stored in ascii format that I need to analyze to get summary statistics for. The analysis should output the summary statistics to an Excel workbook.
The primary objective of this exercise is to screen a large amount of financial information for missing entries. This is not a difficult project by any means, but the skill involved is in getting the program to run within a reasonalble time period. Assume each of the .txt files are around 30Mb.
Currently, the data is stored in around 600 discreet .txt files. I have included one example file which is around half the size of the usual files. Freelancer won't allow me to upload due to size restrictions so please PM me for more examples. The data is stored in the usual format mm/dd/yyyy hh:mm open,high,low,close,volume.
As you will see, these files are large. I need someone who is an expert in very high speed text parsing / data analysis. I'm not at all concerned as to which programming language the developer chooses to use but I would imagine one of the Microsoft programming languages would be easier given that the summary statistics output must be in Excel 2007.
I would like the application to perform analysis and develop summary statistics for each stock. The summary statistics should include:
Date of First Data point
Date of Last data point
Number of datapoints per day (A time series excel graph for each day for each stock)
The dates Days when there are no data points (Weekends should not be included for each of these)
The output workbook will be composed of three main areas:
1) A summary sheet which list all the stocks in the folder to be analysed. Their start dates, end dates and number of missing days, and the days they are missing entries.
2) A second worksheet that contains a list of days that are not to be included in the analysis (ie. non-trading days excluding weekends)
3) The full analysis sheets for each stock file including a time series graph with the number of data points for each day. And a list of the days for which there are no entries.
That means that the final output will be an excel workbook with a large number of workbooks, one for each stock and two additional ones as described in points 1 and 2 above.
The application must also ask me for the period over which I wish to perform the analysis. Ie.
I will be using the application to analyze all of the files sequentially so here's the challenge:- I the application MUST run through all (600) of the text files in less than 18 hours on a Dual Core Pentium T2050 laptop with 2Gb of RAM running Windows XP.
To ensure compliance, I will provide the winning bidder(s) with 50 text files and the run though must be completed in less than one and a half hours on a system of similar specification to mine. The winning bidders must ensure that the final application can do this before submitting the work for payment.
In responses, I would very much appreciate an outline of how you intend to proceed and some form of evidence that you have experience working with large size, high speed parsing would be a distinct advantage. This project should be easy money for the right person. I would not expect the application to take more than two days to develop following bidder selection.
Please note that the files are much too large to import into Excel (any version).
Please note that the files are much too large to import into Excel (any version) therefore the summary statistics will need to be calculated outside of Excel.
An enormous thanks to everyone for submitting their bids.
There is quite a lot of work for me to get through answering all of the various questions that have arisen.
I intend to do all this on Sunday and make my final selection then.
One other thing I forgot to mention is that the program must be able to removed duplicate entries. As an example of this in the newly added attachment, you will see that from 02/19/2010 onwards, there are duplicate lines for each time frame.
The program must output a new 'cleansed' file without these duplicate entries in a folder designated by the user.
Thank you all very much for your bids.
However, I cannot possible re-iterate the following enough:
The data chencking cannot be done in Excel (any version). The sample files I have provided are representative of the actual files.
Please assume that the actual files have a minimum of three million entries each and there are 600 such files which must be processed in a batch, and require no user intervention from the moment of pressing the 'GO' button.
As a result, I have considerably extended the bidding time by three weeks.