Performing intra-frame prediction in Matlab

I am trying to implement a hybrid video coding system that is used in the H.264 / MPEG-4 video standard, for which I need to perform "Intra-frame Prediction" and "Inter Prediction" (which, motion estimation) of a set of 30 frames for video processing in Matlab. I work with a mother daughter.

Note that this post is very similar to my previously asked question , but it is purely based on a Matlab computation.

Edit: I am trying to implement the structure shown below:

enter image description here

My question is how to execute the horizontal coding method, which is one of the nine methods of the Intra Coding system? How are pixels selected?

enter image description here

What I'm confusing is that Intra Prediction needs two inputs, which are 8x8 blocks of the input frame and 8x8 blocks of the reconstructed frame. But what happens when you encode the very first block of the input frame, since there will be no recovered pixels to perform horizontal encoding?

In the image above, the whole system is a closed loop, where do you start?

END:

Question 1 . Is the internally predicted picture only for the first picture (I-frame) of the sequence or does it need to be computed for all 30 frames?

I know there are five intra coding modes, which are horizontal, vertical, constant, left and right and right up left down.

Question 2 : How can I get around comparing the restored frame and the anchored frame (original current frame)?

Question 3 . Why do you need a search area? Is it possible to use individual 8x8 blocks as a search area made one pixel at a time?

I know the pixels from the recovered block are used for comparison, but is it being executed one pixel at a time in the search area? If it wasn't too long, if you need to process 30 frames?

+1


source to share


2 answers


Continuing from your previous post, please answer one question at a time.


Question number 1

Usually you use one I-frame and refer to this as a keyframe. After you use this for each 8 x 8 block that is in your frame of reference, you look at the next frame and figure out where the 8 x 8 blocks are best to move in the next frame. You describe this displacement as a motion vector, and you construct a P-frame that consists of this information. This tells you where the 8 x 8 block from the keyframe is best moved in that frame.

Now, the next question you can ask is how many frames is it going to take before we decide to use another keyframe? It is completely up to you and you have configured it in your decoder settings. For digital broadcasting and DVD storage, it is recommended to create an I-frame every 0.5 seconds or so. Assuming 24 frames per second, this means you need to create an I-frame every 12 frames. This Wikipedia article was where I got this link.

As for intra-coding modes, they tell the encoder which direction you should look in when trying to find the best matching block. Actually, take a look at this article , which talks about different prediction modes. Take a look at Figure 1 and it gives a very good summary of the different forecasting modes. In fact, there are nine all together. Also take a look at this Wikipedia article for more vivid pictures of various prediction engines. To get maximum accuracy, they also estimate subpixels with 1/4 pixel precision by doing bilinear interpolation between pixels.

I'm not sure if you only want motion compensation with P-frames or if you want B-frames as well. I'm going to assume that you will need both. So take a look at this diagram that I pulled from Wikipedia:

Source: Wikipedia

This is a very common sequence for encoding frames in your video. It follows the format:

IBBPBBPBBI...

      

At the bottom there is a time axis that tells you the sequence of frames that are sent to the decoder after the frames are encoded. First, you need to encode I-frames, then P-frames, and then B-frames. A typical sequence of frames encoded between I-frames follows this format as you see in the figure. The part of the frames between I-frames is called the so-called Group of Pictures (GOP). If you recall from our previous post, B-frames use information back and forth beyond its current position. So, to summarize the timeline, this is what is usually done on the encoder side:

  • I-frame is encoded and then used to predict the first P-frame
  • The first I-frame and the first P-frame are then used to predict the first and second B-frames in between these frames.
  • The second P-frame is predicted using the first P-frame, and the third and fourth B-frames are created using the information between the first P-frame and the second P-frame
  • Finally, the last frame in the GOP is an I-frame. This is encoded, then the information between the second P-frame and the second I-frame (last frame) is used to generate the fifth and sixth B-frames

So what should happen is you have to send I-frames first, then P-frames and then B-frames. The decoder must wait for P-frames before B-frames can be reconstructed. However, this decoding method is more reliable because:

  • This minimizes the problem of possible uncovered areas.
  • P-frames and B-frames require less data than I-frames, so less data is transmitted.


However, B-frames will require more motion vectors, so there will be slightly higher bit rates.

Question number 2

To be honest, what I've seen is people making simple square differences between one frame and another to compare the similarities. You take your color components (be it RGB, YUV, etc.) For each pixel from one frame at one position, subtract them from the color components at the same spatial location in another frame, connect each component and add them together You accumulate all of these differences for each location in your frame. The higher the value, the greater the difference between one frame and the next.

Another well-known measure is called Structural Similarity , where some statistical measurements, such as mean and variance, are used to estimate how similar two frames are.

There are a whole host of other video quality metrics that are used, and there are advantages and disadvantages to using any of them. Rather than telling you which one to use, I put you back on this Wikipedia article so you can decide which one to use for yourself depending on your application.This Wikipedia article describes a collection of video similarity and quality metrics and the dollar does not stop there. Research is still ongoing on which numeric measures best account for the similarity and quality between two frames.

Question number 3

When looking for the best block from an I-frame that has been moved in a P-frame, you need to restrict the search to a finite-size region from the location of that I-frame block, because you don't want the encoder to find all locations in the frame. It's just too computationally intensive and will thus make your decoder slow. I actually mentioned this in a previous post.

Using one pixel to find another pixel in the next frame is a very bad idea due to the small amount of information that this single pixel contains. The reason you are comparing blocks at the time you perform motion estimation is because, typically, pixel blocks have many variations within blocks that are unique to the block itself. If we can find this same variation in a different area of ​​your next frame, then this is a very good candidate that this group of pixels moved along with this new block. Remember, we're assuming the frame rate for the video is high enough that most of the pixels in your frame either don't move at all or move very slowly. Using blocks allows the match to be more accurate.

The blocks are compared in one go, and the compared blocks are compared against one of those video similarity measures I talked about in the Wikipedia article I linked to. You are of course correct that doing this for 30 frames will indeed be slow, but there are existing versions that are highly optimized for fast encoding. FFMPEG is a good example . In fact, I use FFMPEG all the time. FFMPEG is highly customizable and you can create an encoder / decoder that uses your system architecture. I configured it so that encoding / decoding uses all cores on my machine (8 in total).

This doesn't really answer the actual block comparison itself. In fact, the H.264 standard has a bunch of prediction mechanisms so that you don't look at all the blocks in an I-frame to predict the next P-frame (or one P-frame to the next P-frame, etc.). This applies to the different prediction modes in the Wikipedia article and in the document I linked to. The encoder is smart enough to detect the pattern and then generalize the area of ​​your image where it thinks it will show the same amount of movement. He skips this area and moves to the next.


This task is (in my opinion) too broad. There are so many subtleties about motion prediction / compensation that there is a reason most video engineers already use the tools available to do the job for us. Why reinvent the wheel when it's already complete, right?

Hope this answers your questions correctly. I believe that I have given you more questions than I actually answered, but I hope this is enough for you to delve deeper into this topic and achieve your overall goal.

Good luck!

+3


source


Question 1: Is the internally predicted picture only for the first picture (I-frame) of the sequence or does it need to be computed for all 30 frames?

I know there are five intra coding modes, which are horizontal, vertical, constant, left and right and right up left down.

Answer: Intra prediction should not be used for all frames.



Question 2: How do I really compare the reconstructed frame and the anchor frame (original current frame)?

Question 3: Why do I need a search area? Is it possible to use individual 8x8 blocks as a search area made one pixel at a time?

Answer: we need to use a block matching algorithm to find the motion vector. so the search area will do. Usually the size of the search area should be larger than the block size. larger search area, more calculations, and higher precision.

-1


source







All Articles