As someone technical, I have a general understanding about how LLMs work.

But there are some small things that puzzle me:

1. LLMs are next-token predictors, right? How do we know they are done answering? We could continue feeding tokens forever.

2. How come the LLMs actually answer things, and do not continue the prompt as if it was trying to predict what the continuation of the prompt should be. Like if I prompt: "What is the capital of " why would it say "What country do you want to know the capital of?", and not "United States ?"

3. Even if the "harness" is padding things to make the LLM understand what the user is saying, and contrast that to what the LLM should say, that makes a very bizarre overall text. Do companies transform the training data so that it fits that format? Or does it magically happen?

I guess I am trying to sort out the "magical" part, which definitely happens, from the training / harness part.

Any understandable resource on this?