I have been doing some work on a CommonMark parser lately, and ran into a problem. When I am converting from raw CommonMark input to anything else, there’s an intermediary step which I would think of as “the AST step” - input has been parsed into an object tree, but not rendered to other languages like HTML. From this step, I wrote my parser such that it preserves whitespace, newlines, etc. so you could convert back into exactly the same raw CommonMark input.
Comparing my work against the “AST” tab in the dingus, I now see that my parser wouldn’t match the official AST in a lot of places. This is because the CommonMark spec recommends changes to input (such as the raw contents of an ATX heading being stripped of leading and trailing whitespace, for example) which would prevent knowing the original CommonMark input once parsing is complete.
Can anyone help me figure out whether parsing CommonMark truly requires losing data from the original input? Am I missing something? (I would think of anything which irreversibly loses data from the original input as steps to be performed only before rendering the AST to another language, but the spec doesn’t seem to support that idea.)