Match RTP packets with decoded image frames

I am currently using GStreamer to stream video over network. In my application, I need to perfectly sync data with the video images (zero latency between both data and the corresponding images).
Since data size per frame is very small (8 bytes), I am adding this data into the first RTP packet's padding for each encoded frame just before sending it over UDP. On the receiver side, after I receive the packet I extract the data from the RTP padding and save it in a queue. Once the frame is decoded, I pop from the queue the first element which suppose to match the decoded frame.
In theory it should work fine, however in practice, the sink can drop some frames which would results mismatching between the frames and the corresponding data in the queue.

I am looking for better solutions in which I can perfectly sync between the received RTP packets and the corresponding decoded image frames.

I tried to add metadata into the RTP packets on the receiver side, and extract the meta data after decoding, however it didn't work (the metadata is removed apparently).

Looking forward for an advice!