I have been working as a AI developer for past 3 years especially in the field of NLP. BERT is based on transformer architecture which implements attention mechanism. It has encoders and decoders. During training, the question is feed to the encoders and the state vectors from encoders along with the answers are fed to the decoders. Uses CBOW to predict the next word in the answer sequence during inference. The quality of output depends on the amount of dataset you have. If the dataset is low this will not generate quality output. Besides, it is also computationally expensive during training phase. I personally suggest you to implement retrieval chatbot instead of this for the project which will be more reliable and lot less effort.