I need some Python code that will join files. The files I will be joining are large - up to 5 GB. I need code that is fast and efficient.
I can't use the Python csv module for this since I may need this code to read from HDFS in a Hadoop cluster.
I have 3 files, A,B, and C. A is the master files and needs to do something similar to a LEFT OUTER JOIN in SQL with files B and C.
File A is the master file and has 6 fields:
File B has 3 fields
File C has 4 fields:
File A joins to file B on column A
File A joins to file C on column A
The final output file will look like
The code has to be fast since the files sizes can be up to 5 GB