Using Big Data technology from AWS, the objective of this project is to implement the necessary infrastructure in aws to be able to process and relate a series of datasets. The objective is that different types of queries can be made on the data already processed and related. For this, a relational database (4,000,000,000 and 300,00.00 entries are the biggest tables) has been designed.
A series of Datasets have to be downloaded periodically, and been processed to form an up-to-date relational database. There will be some dataset that will be provided by me and it does not have to be downloaded from the internet. The largest datasets can be up to 20GB in size in compressed format. It should be possible to do queries both in the downloaded datasets and the related database.
All the details about the all datasets are already written in a document, with informaton like where to download datasets and what information about each of them must be stored in the related database and where. It is also defined what the relationships between the tables will be. There are also defined a series of example queries that should be able to be done in the database.