D7z — Menu V2 Link ((hot))
The core innovation of V2 lies in the decoding phase. Let $X$ be the image embedding and $Y = y_1, y_2, ..., y_T$ be the target token sequence (JSON string).
Current solutions rely heavily on two paradigms: d7z menu v2 link