# Statistical Reserch Work

I have highlighted in my "Identifying Foreign Key Relationships using [url removed, login to view]" regarding pre-conditions.

The solution for each of these questions should be generic to any data and it should be comprising of statistical formulas/ equations.

The data I have provided is for reference purpose only. I have to apply the statistical formulas/ equations you will be providing me to millions of records belonging to different databases.

I know we can achieve most of it using SAP or SPSS tools if we are studying on specific data. But my research requires me to come up with statistical & mathematical formulas/ equations which needs to automatically do this job for any kind of data provided.

The solution should be something like the attachment which is generic for any data being inputted.

Job Description:

=================================================================
1. How to find similarities between strings having similar hash values (HashValue1 HashValue2 HashValue3 HashValue4 HashValue5 HashValue6 HashValue7 HashValue8 HashValue9).

These hash values associated with each word can be in any order:

For Example
Two(-1787886117) Bike(-1099549855) Shops(1918157825)
Every(733576683) Bike(-1099549855) Shop(-1201797668)
Bike(-1099549855) Shops(1918157825)
Every(733576683) Shop(-1201797668)
Every(733576683) Shops(1918157825)

All of these needs to be ranked based on their similarities hash values.

2.How to find similarities between Groups of Strings based on string hash values. for example: between Group A and Group B based. It could be 60/70/..% similar.

3. How to find relationships between Groups based on String having hash values. The goal is to find if Group A points to Group B in majority of cases or if Group B points to Group A to identify relationships. For example: Group A is the parent and Group B is the child where children s point to parent.

Please Note: I need statistical proof in theory with a good example in writing to prove this works on millions of sample records in the quickest possible of time on sampling data.

------------------------------------------
Added 8 MAY 2010, 5:59 AM EDT