Does it make sense to consider timestamps of each individual video frames/RTP buffers applied at Tx device, to detect delay at Rx device?

Hello Together,

I have asked this as a follow up question to my previous post, but I think
its better I make it as a separate question to explain more details about
the problem I am facing. Please bear the long question :)

Here are my pipelines, that transmit live video from iMX6 device to Ubuntu
PC over WiFi:

Tx pipeline:
v4l2src  fps-n=30 -> h264encode ->  rtph264pay -> rtpbin ->
udpsink(port=5000) ->
rtpbin.send_rtcp(port=5001) -> rtpbin.recv_rtcp(port=5002)

Rx pipeline:
udpsrc(port=5000) -> caps -> rtpbin -> rtph264depay -> h264parse ->
avdec_h264 ->
rtpbin.recv_rtcp(port=5001) -> rtpbin.send_rtcp(port=5002) -> videosink

Now as per my application, I intend to detect the delay in receiving frames
at the Rx device. The delay can be induced by a number of factors including:
- congestion
- packet loss
- noise , etc.
Once the delay is detected, I intend to insert a IMU(inertial measurement
unit) frame (custom visualization) in between the live video frame. For eg,
if every 3rd frame is delayed, the video will look like:  
                        V | V | I | V | V | I | V | V | I | V | .....

where V - video frame received and I - IMU frame inserted at Rx device

1. Hence as per my application requirements, to achieve this I must have a
knowledge of the timestamp of the video frame sent from Tx, and use this
timestamp with the current timestamp at Rx device to get the delay in

   frame delay = Current time at Rx - Timestamp of frame at Tx
Since I am working at 30 fps, ideally I should expect that I receive video
frames at the Rx device every 33ms. Given the situation that its WiFi, and
other delays including encoding/decoding I understand that this 33ms
precision is difficult to achieve and its perfectly fine for me.  

2. Since, I am using RTP/RTCP , I had a look into WebRTC but it caters more
towards sending SR/RR (network statistics) only for a fraction of the data
sent from Tx -> Rx.  I also tried using the UDP source timeout feature that
detects if there are no packets at the source for a predefined time and
issues signal notifying the timeout. However, this works only if the Tx
device completely stops(pipeline stopped using Ctrl+C). If the packets are
delayed, the timeout does not occur since the kernel buffers some old data.

I have the following questions :

1. Does it make sense to use the timestamps of each video frame/RTP buffers
to detect the delay in receiving frames at the Rx device ? What would be a
better design to consider for such an usecase ? Or is it too much overhead
to consider the timestamp of each frame/buffer and may be I can consider
timestamps of factor of video frames like every 5th video frame/buffer, or
every 10 the frame/buffer? Considering the worst case possible where each
alternate frame is delayed the video would have the following sequence :  
                   V | I | V| I | V | I | V | I | V | I | .....
I understand that the precision of each alternate frame can be difficult to
handle, so I am targetting a detection and insertion of IMU frame atleast
within 66 ms. Also the switching between live video frame and insertion
frame is a concern. I use the OpenGL plugins to do IMU data manipulation.

2. Which timestamps should I be considering at the Rx device? To calculate
the delay, I need a common reference between the Tx and Rx device, which I
do not have a knowledge about. I could access the PTS and DTS of the RTP
buffers, but since no reference was available I could not use this to detect
the delay. Is there any other way I could do this?

3. My caps has the following parameters (only few parameters showed) :
application/x-rtp , clock-rate = 90000, timestamp-offset =
2392035930,seqnum-offset= 23406
Can this be used to calculate the reference at Tx and Rx ? I am not sure if
I understand these numbers and how to use them at Rx device to get a
reference. Any pointers on understanding these parameters?

4. Any other possible approaches that can be undertaken for such an
application. My above idea could be too impractical and I am open to
suggestions to tackle this issue.

Since this is my University project, I rarely have any support available.
Would be great if someone points me to some direction, be it completely new
or improvement to the current design.

