[D] How exactly did Deepseek R1 achieve massive training cost reductions, most posts I read are about its performance, RL, chain of thought, etc, but it’s not clear how the cost of training of the model was brought down so drastically