Reasoning or Simply Next Token Prediction? Stress-Testing Large LLMs