All,
Sorry for casting such a broad net with this one. I'm sure most people who reply will get at least one mailing list rejection. However, this is an issue that affects a LOT of components and that's why it's thorny to begin with. Please pardon the length of this e-mail as well; I promise there's a concrete point/proposal at the end. Explicit synchronization is the future of graphics and media. At least, that seems to be the consensus among all the graphics people I've talked to. I had a chat with one of the lead Android graphics engineers recently who told me that doing explicit sync from the start was one of the best engineering decisions Android ever made. It's also the direction being taken by more modern APIs such as Vulkan. ## What are implicit and explicit synchronization? For those that aren't familiar with this space, GPUs, media encoders, etc. are massively parallel and synchronization of some form is required to ensure that everything happens in the right order and avoid data races. Implicit synchronization is when bits of work (3D, compute, video encode, etc.) are implicitly based on the absolute CPU-time order in which API calls occur. Explicit synchronization is when the client (whatever that means in any given context) provides the dependency graph explicitly via some sort of synchronization primitives. If you're still confused, consider the following examples: With OpenGL and EGL, almost everything is implicit sync. Say you have two OpenGL contexts sharing an image where one writes to it and the other textures from it. The way the OpenGL spec works, the client has to make the API calls to render to the image before (in CPU time) it makes the API calls which texture from the image. As long as it does this (and maybe inserts a glFlush?), the driver will ensure that the rendering completes before the texturing happens and you get correct contents. Implicit synchronization can also happen across processes. Wayland, for instance, is currently built on implicit sync where the client does their rendering and then does a hand-off (via wl_surface::commit) to tell the compositor it's done at which point the compositor can now texture from the surface. The hand-off ensures that the client's OpenGL API calls happen before the server's OpenGL API calls. A good example of explicit synchronization is the Vulkan API. There, a client (or multiple clients) can simultaneously build command buffers in different threads where one of those command buffers renders to an image and the other textures from it and then submit both of them at the same time with instructions to the driver for which order to execute them in. The execution order is described via the VkSemaphore primitive. With the new VK_KHR_timeline_semaphore extension, you can even submit the work which does the texturing BEFORE the work which does the rendering and the driver will sort it out. The #1 problem with implicit synchronization (which explicit solves) is that it leads to a lot of over-synchronization both in client space and in driver/device space. The client has to synchronize a lot more because it has to ensure that the API calls happen in a particular order. The driver/device have to synchronize a lot more because they never know what is going to end up being a synchronization point as an API call on another thread/process may occur at any time. As we move to more and more multi-threaded programming this synchronization (on the client-side especially) becomes more and more painful. ## Current status in Linux Implicit synchronization in Linux works via a the kernel's internal dma_buf and dma_fence data structures. A dma_fence is a tiny object which represents the "done" status for some bit of work. Typically, dma_fences are created as a by-product of someone submitting some bit of work (say, 3D rendering) to the kernel. The dma_buf object has a set of dma_fences on it representing shared (read) and exclusive (write) access to the object. When work is submitted which, for instance renders to the dma_buf, it's queued waiting on all the fences on the dma_buf and and a dma_fence is created representing the end of said rendering work and it's installed as the dma_buf's exclusive fence. This way, the kernel can manage all its internal queues (3D rendering, display, video encode, etc.) and know which things to submit in what order. For the last few years, we've had sync_file in the kernel and it's plumbed into some drivers. A sync_file is just a wrapper around a single dma_fence. A sync_file is typically created as a by-product of submitting work (3D, compute, etc.) to the kernel and is signaled when that work completes. When a sync_file is created, it is guaranteed by the kernel that it will become signaled in finite time and, once it's signaled, it remains signaled for the rest of time. A sync_file is represented in UAPIs as a file descriptor and can be used with normal file APIs such as dup(). It can be passed into another UAPI which does some bit of queue'd work and the submitted work will wait for the sync_file to be triggered before executing. A sync_file also supports poll() if you want to wait on it manually. Unfortunately, sync_file is not broadly used and not all kernel GPU drivers support it. Here's a very quick overview of my understanding of the status of various components (I don't know the status of anything in the media world): - Vulkan: Explicit synchronization all the way but we have to go implicit as soon as we interact with a window-system. Vulkan has APIs to import/export sync_files to/from it's VkSemaphore and VkFence synchronization primitives. - OpenGL: Implicit all the way. There are some EGL extensions to enable some forms of explicit sync via sync_file but OpenGL itself is still implicit. - Wayland: Currently depends on implicit sync in the kernel (accessed via EGL/OpenGL). There is an unstable extension to allow passing sync_files around but it's questionable how useful it is right now (more on that later). - X11: With present, it has these "explicit" fence objects but they're always a shmfence which lets the X server and client do a userspace CPU-side hand-off without going over the socket (and round-tripping through the kernel). However, the only thing that fence does is order the OpenGL API calls in the client and server and the real synchronization is still implicit. - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync. - linux/amdgpu: Supports sync_file and syncobj but it still implicitly syncs sometimes due to it's internal memory residency handling which can lead to over-synchronization. - KMS: Implicit sync all the way. There are no KMS APIs which take explicit sync primitives. - v4l: ??? - gstreamer: ??? - Media APIs such as vaapi etc.: ??? ## Chicken and egg problems Ok, this is where it starts getting depressing. I made the claim above that Wayland has an explicit synchronization protocol that's of questionable usefulness. I would claim that basically any bit of plumbing we do through window systems is currently of questionable usefulness. Why? From my perspective, as a Vulkan driver developer, I have to deal with the fact that Vulkan is an explicit sync API but Wayland and X11 aren't. Unfortunately, the Wayland extension solves zero problems for me because I can't really use it unless it's implemented in all of the compositors. Until every Wayland compositor I care about my users being able to use (which is basically all of them) supports the extension, I have to continue carry around my pile of hacks to keep implicit sync and Vulkan working nicely together. From the perspective of a Wayland compositor (I used to play in this space), they'd love to implement the new explicit sync extension but can't. Sure, they could wire up the extension, but the moment they go to flip a client buffer to the screen directly, they discover that KMS doesn't support any explicit sync APIs. So, yes, they can technically implement the extension assuming the EGL stack they're running on has the sync_file extensions but any client buffers which come in using the explicit sync Wayland extension have to be composited and can't be scanned out directly. As a 3D driver developer, I absolutely don't want compositors doing that because my users will complain about performance issues due to the extra blit. Ok, so let's say we get KMS wired up with implicit sync. That solves all our problems, right? It does, right up until someone decides that they wan to screen capture their Wayland session via some hardware media encoder that doesn't support explicit sync. Now we have to plumb it all the way through the media stack, gstreamer, etc. Great, so let's do that! Oh, but gstreamer won't want to plumb it through until they're guaranteed that they can use explicit sync when displaying on X11 or Wayland. Are you seeing the problem? To make matters worse, since most things are doing implicit synchronization today, it's really easy to get your explicit synchronization wrong and never notice. If you forget to pass a sync_file into one place (say you never notice KMS doesn't support them), it will probably work anyway thanks to all the implicit sync that's going on elsewhere. So, clearly, we all need to go write piles of code that we can't actually properly test until everyone else has written their piece and then we use explicit sync if and only if all components support it. Really? We're going to do multiple years of development and then just hope it works when we finally flip the switch? That doesn't sound like a good plan to me. ## A proposal: Implicit and explicit sync together How to solve all these chicken-and-egg problems is something I've been giving quite a bit of thought (and talking with many others about) in the last couple of years. One motivation for this is that we have to deal with a mismatch in Vulkan. Another motivation is that I'm becoming increasingly unhappy with the way that synchronization, memory residency, and command submission are inherently intertwined in i915 and would like to break things apart. Towards that end, I have an actual proposal. A couple weeks ago, I sent a series of patches to the dri-devel mailing list which adds a pair of new ioctls to dma-buf which allow userspace to manually import or export a sync_file from a dma-buf. The idea is that something like a Wayland compositor can switch to 100% explicit sync internally once the ioctl is available. If it gets buffers in from a client that doesn't use the explicit sync extension, it can pull a sync_file from the dma-buf and use that exactly as it would a sync_file passed via the explicit sync extension. When it goes to scan out a user buffer and discovers that KMS doesn't accept sync_files (or if it tries to use that pesky media encoder no one has converted), it can take it's sync_file for display and stuff it into the dma-buf before handing it to KMS. Along with the kernel patches, I've also implemented support for this in the Vulkan WSI code used by ANV and RADV. With those patches, the only requirement on the Vulkan drivers is that you be able to export any VkSemaphore as a sync_file and temporarily import a sync_file into any VkFence or VkSemaphore. As long as that works, the core Vulkan driver only ever sees explicit synchronization via sync_file. The WSI code uses these new ioctls to translate the implicit sync of X11 and Wayland to the explicit sync the Vulkan driver wants. I'm hoping (and here's where I want a sanity check) that a simple API like this will allow us to finally start moving the Linux ecosystem over to explicit synchronization one piece at a time in a way that's actually correct. (No Wayland explicit sync with compositors hoping KMS magically works even though it doesn't have a sync_file API.) Once some pieces in the ecosystem start moving, there will be motivation to start moving others and maybe we can actually build the momentum to get most everything converted. For reference, you can find the kernel RFC patches and mesa MR here: https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037 At this point, I welcome your thoughts, comments, objections, and maybe even help/review. :-) --Jason Ekstrand _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <[hidden email]> wrote:
> > All, > > Sorry for casting such a broad net with this one. I'm sure most people > who reply will get at least one mailing list rejection. However, this > is an issue that affects a LOT of components and that's why it's > thorny to begin with. Please pardon the length of this e-mail as > well; I promise there's a concrete point/proposal at the end. > > > Explicit synchronization is the future of graphics and media. At > least, that seems to be the consensus among all the graphics people > I've talked to. I had a chat with one of the lead Android graphics > engineers recently who told me that doing explicit sync from the start > was one of the best engineering decisions Android ever made. It's > also the direction being taken by more modern APIs such as Vulkan. > > > ## What are implicit and explicit synchronization? > > For those that aren't familiar with this space, GPUs, media encoders, > etc. are massively parallel and synchronization of some form is > required to ensure that everything happens in the right order and > avoid data races. Implicit synchronization is when bits of work (3D, > compute, video encode, etc.) are implicitly based on the absolute > CPU-time order in which API calls occur. Explicit synchronization is > when the client (whatever that means in any given context) provides > the dependency graph explicitly via some sort of synchronization > primitives. If you're still confused, consider the following > examples: > > With OpenGL and EGL, almost everything is implicit sync. Say you have > two OpenGL contexts sharing an image where one writes to it and the > other textures from it. The way the OpenGL spec works, the client has > to make the API calls to render to the image before (in CPU time) it > makes the API calls which texture from the image. As long as it does > this (and maybe inserts a glFlush?), the driver will ensure that the > rendering completes before the texturing happens and you get correct > contents. > > Implicit synchronization can also happen across processes. Wayland, > for instance, is currently built on implicit sync where the client > does their rendering and then does a hand-off (via wl_surface::commit) > to tell the compositor it's done at which point the compositor can now > texture from the surface. The hand-off ensures that the client's > OpenGL API calls happen before the server's OpenGL API calls. > > A good example of explicit synchronization is the Vulkan API. There, > a client (or multiple clients) can simultaneously build command > buffers in different threads where one of those command buffers > renders to an image and the other textures from it and then submit > both of them at the same time with instructions to the driver for > which order to execute them in. The execution order is described via > the VkSemaphore primitive. With the new VK_KHR_timeline_semaphore > extension, you can even submit the work which does the texturing > BEFORE the work which does the rendering and the driver will sort it > out. > > The #1 problem with implicit synchronization (which explicit solves) > is that it leads to a lot of over-synchronization both in client space > and in driver/device space. The client has to synchronize a lot more > because it has to ensure that the API calls happen in a particular > order. The driver/device have to synchronize a lot more because they > never know what is going to end up being a synchronization point as an > API call on another thread/process may occur at any time. As we move > to more and more multi-threaded programming this synchronization (on > the client-side especially) becomes more and more painful. > > > ## Current status in Linux > > Implicit synchronization in Linux works via a the kernel's internal > dma_buf and dma_fence data structures. A dma_fence is a tiny object > which represents the "done" status for some bit of work. Typically, > dma_fences are created as a by-product of someone submitting some bit > of work (say, 3D rendering) to the kernel. The dma_buf object has a > set of dma_fences on it representing shared (read) and exclusive > (write) access to the object. When work is submitted which, for > instance renders to the dma_buf, it's queued waiting on all the fences > on the dma_buf and and a dma_fence is created representing the end of > said rendering work and it's installed as the dma_buf's exclusive > fence. This way, the kernel can manage all its internal queues (3D > rendering, display, video encode, etc.) and know which things to > submit in what order. > > For the last few years, we've had sync_file in the kernel and it's > plumbed into some drivers. A sync_file is just a wrapper around a > single dma_fence. A sync_file is typically created as a by-product of > submitting work (3D, compute, etc.) to the kernel and is signaled when > that work completes. When a sync_file is created, it is guaranteed by > the kernel that it will become signaled in finite time and, once it's > signaled, it remains signaled for the rest of time. A sync_file is > represented in UAPIs as a file descriptor and can be used with normal > file APIs such as dup(). It can be passed into another UAPI which > does some bit of queue'd work and the submitted work will wait for the > sync_file to be triggered before executing. A sync_file also supports > poll() if you want to wait on it manually. > > Unfortunately, sync_file is not broadly used and not all kernel GPU > drivers support it. Here's a very quick overview of my understanding > of the status of various components (I don't know the status of > anything in the media world): > > - Vulkan: Explicit synchronization all the way but we have to go > implicit as soon as we interact with a window-system. Vulkan has APIs > to import/export sync_files to/from it's VkSemaphore and VkFence > synchronization primitives. > - OpenGL: Implicit all the way. There are some EGL extensions to > enable some forms of explicit sync via sync_file but OpenGL itself is > still implicit. > - Wayland: Currently depends on implicit sync in the kernel (accessed > via EGL/OpenGL). There is an unstable extension to allow passing > sync_files around but it's questionable how useful it is right now > (more on that later). > - X11: With present, it has these "explicit" fence objects but > they're always a shmfence which lets the X server and client do a > userspace CPU-side hand-off without going over the socket (and > round-tripping through the kernel). However, the only thing that > fence does is order the OpenGL API calls in the client and server and > the real synchronization is still implicit. > - linux/i915/gem: Fully supports using sync_file or syncobj for explicit sync. > - linux/amdgpu: Supports sync_file and syncobj but it still > implicitly syncs sometimes due to it's internal memory residency > handling which can lead to over-synchronization. > - KMS: Implicit sync all the way. There are no KMS APIs which take > explicit sync primitives. Correction: Apparently, I missed some things. If you use atomic, KMS does have explicit in- and out-fences. Non-atomic users (e.g. X11) are still in trouble but most Wayland compositors use atomic these days > - v4l: ??? > - gstreamer: ??? > - Media APIs such as vaapi etc.: ??? > > > ## Chicken and egg problems > > Ok, this is where it starts getting depressing. I made the claim > above that Wayland has an explicit synchronization protocol that's of > questionable usefulness. I would claim that basically any bit of > plumbing we do through window systems is currently of questionable > usefulness. Why? > > From my perspective, as a Vulkan driver developer, I have to deal with > the fact that Vulkan is an explicit sync API but Wayland and X11 > aren't. Unfortunately, the Wayland extension solves zero problems for > me because I can't really use it unless it's implemented in all of the > compositors. Until every Wayland compositor I care about my users > being able to use (which is basically all of them) supports the > extension, I have to continue carry around my pile of hacks to keep > implicit sync and Vulkan working nicely together. > > From the perspective of a Wayland compositor (I used to play in this > space), they'd love to implement the new explicit sync extension but > can't. Sure, they could wire up the extension, but the moment they go > to flip a client buffer to the screen directly, they discover that KMS > doesn't support any explicit sync APIs. As per the above correction, Wayland compositors aren't nearly as bad off as I initially thought. There may still be weird screen capture cases but the normal cases of compositing and displaying via KMS/atomic should be in reasonably good shape. > So, yes, they can technically > implement the extension assuming the EGL stack they're running on has > the sync_file extensions but any client buffers which come in using > the explicit sync Wayland extension have to be composited and can't be > scanned out directly. As a 3D driver developer, I absolutely don't > want compositors doing that because my users will complain about > performance issues due to the extra blit. > > Ok, so let's say we get KMS wired up with implicit sync. That solves > all our problems, right? It does, right up until someone decides that > they wan to screen capture their Wayland session via some hardware > media encoder that doesn't support explicit sync. Now we have to > plumb it all the way through the media stack, gstreamer, etc. Great, > so let's do that! Oh, but gstreamer won't want to plumb it through > until they're guaranteed that they can use explicit sync when > displaying on X11 or Wayland. Are you seeing the problem? > > To make matters worse, since most things are doing implicit > synchronization today, it's really easy to get your explicit > synchronization wrong and never notice. If you forget to pass a > sync_file into one place (say you never notice KMS doesn't support > them), it will probably work anyway thanks to all the implicit sync > that's going on elsewhere. > > So, clearly, we all need to go write piles of code that we can't > actually properly test until everyone else has written their piece and > then we use explicit sync if and only if all components support it. > Really? We're going to do multiple years of development and then just > hope it works when we finally flip the switch? That doesn't sound > like a good plan to me. > > > ## A proposal: Implicit and explicit sync together > > How to solve all these chicken-and-egg problems is something I've been > giving quite a bit of thought (and talking with many others about) in > the last couple of years. One motivation for this is that we have to > deal with a mismatch in Vulkan. Another motivation is that I'm > becoming increasingly unhappy with the way that synchronization, > memory residency, and command submission are inherently intertwined in > i915 and would like to break things apart. Towards that end, I have > an actual proposal. > > A couple weeks ago, I sent a series of patches to the dri-devel > mailing list which adds a pair of new ioctls to dma-buf which allow > userspace to manually import or export a sync_file from a dma-buf. > The idea is that something like a Wayland compositor can switch to > 100% explicit sync internally once the ioctl is available. If it gets > buffers in from a client that doesn't use the explicit sync extension, > it can pull a sync_file from the dma-buf and use that exactly as it > would a sync_file passed via the explicit sync extension. When it > goes to scan out a user buffer and discovers that KMS doesn't accept > sync_files (or if it tries to use that pesky media encoder no one has > converted), it can take it's sync_file for display and stuff it into > the dma-buf before handing it to KMS. > > Along with the kernel patches, I've also implemented support for this > in the Vulkan WSI code used by ANV and RADV. With those patches, the > only requirement on the Vulkan drivers is that you be able to export > any VkSemaphore as a sync_file and temporarily import a sync_file into > any VkFence or VkSemaphore. As long as that works, the core Vulkan > driver only ever sees explicit synchronization via sync_file. The WSI > code uses these new ioctls to translate the implicit sync of X11 and > Wayland to the explicit sync the Vulkan driver wants. > > I'm hoping (and here's where I want a sanity check) that a simple API > like this will allow us to finally start moving the Linux ecosystem > over to explicit synchronization one piece at a time in a way that's > actually correct. (No Wayland explicit sync with compositors hoping > KMS magically works even though it doesn't have a sync_file API.) > Once some pieces in the ecosystem start moving, there will be > motivation to start moving others and maybe we can actually build the > momentum to get most everything converted. > > For reference, you can find the kernel RFC patches and mesa MR here: > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037 > > At this point, I welcome your thoughts, comments, objections, and > maybe even help/review. :-) > > --Jason Ekstrand gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
(I know I'm going to be spammed by so many mailing list ...)
Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit : > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <[hidden email]> wrote: > > All, > > > > Sorry for casting such a broad net with this one. I'm sure most people > > who reply will get at least one mailing list rejection. However, this > > is an issue that affects a LOT of components and that's why it's > > thorny to begin with. Please pardon the length of this e-mail as > > well; I promise there's a concrete point/proposal at the end. > > > > > > Explicit synchronization is the future of graphics and media. At > > least, that seems to be the consensus among all the graphics people > > I've talked to. I had a chat with one of the lead Android graphics > > engineers recently who told me that doing explicit sync from the start > > was one of the best engineering decisions Android ever made. It's > > also the direction being taken by more modern APIs such as Vulkan. > > > > > > ## What are implicit and explicit synchronization? > > > > For those that aren't familiar with this space, GPUs, media encoders, > > etc. are massively parallel and synchronization of some form is > > required to ensure that everything happens in the right order and > > avoid data races. Implicit synchronization is when bits of work (3D, > > compute, video encode, etc.) are implicitly based on the absolute > > CPU-time order in which API calls occur. Explicit synchronization is > > when the client (whatever that means in any given context) provides > > the dependency graph explicitly via some sort of synchronization > > primitives. If you're still confused, consider the following > > examples: > > > > With OpenGL and EGL, almost everything is implicit sync. Say you have > > two OpenGL contexts sharing an image where one writes to it and the > > other textures from it. The way the OpenGL spec works, the client has > > to make the API calls to render to the image before (in CPU time) it > > makes the API calls which texture from the image. As long as it does > > this (and maybe inserts a glFlush?), the driver will ensure that the > > rendering completes before the texturing happens and you get correct > > contents. > > > > Implicit synchronization can also happen across processes. Wayland, > > for instance, is currently built on implicit sync where the client > > does their rendering and then does a hand-off (via wl_surface::commit) > > to tell the compositor it's done at which point the compositor can now > > texture from the surface. The hand-off ensures that the client's > > OpenGL API calls happen before the server's OpenGL API calls. > > > > A good example of explicit synchronization is the Vulkan API. There, > > a client (or multiple clients) can simultaneously build command > > buffers in different threads where one of those command buffers > > renders to an image and the other textures from it and then submit > > both of them at the same time with instructions to the driver for > > which order to execute them in. The execution order is described via > > the VkSemaphore primitive. With the new VK_KHR_timeline_semaphore > > extension, you can even submit the work which does the texturing > > BEFORE the work which does the rendering and the driver will sort it > > out. > > > > The #1 problem with implicit synchronization (which explicit solves) > > is that it leads to a lot of over-synchronization both in client space > > and in driver/device space. The client has to synchronize a lot more > > because it has to ensure that the API calls happen in a particular > > order. The driver/device have to synchronize a lot more because they > > never know what is going to end up being a synchronization point as an > > API call on another thread/process may occur at any time. As we move > > to more and more multi-threaded programming this synchronization (on > > the client-side especially) becomes more and more painful. > > > > > > ## Current status in Linux > > > > Implicit synchronization in Linux works via a the kernel's internal > > dma_buf and dma_fence data structures. A dma_fence is a tiny object > > which represents the "done" status for some bit of work. Typically, > > dma_fences are created as a by-product of someone submitting some bit > > of work (say, 3D rendering) to the kernel. The dma_buf object has a > > set of dma_fences on it representing shared (read) and exclusive > > (write) access to the object. When work is submitted which, for > > instance renders to the dma_buf, it's queued waiting on all the fences > > on the dma_buf and and a dma_fence is created representing the end of > > said rendering work and it's installed as the dma_buf's exclusive > > fence. This way, the kernel can manage all its internal queues (3D > > rendering, display, video encode, etc.) and know which things to > > submit in what order. > > > > For the last few years, we've had sync_file in the kernel and it's > > plumbed into some drivers. A sync_file is just a wrapper around a > > single dma_fence. A sync_file is typically created as a by-product of > > submitting work (3D, compute, etc.) to the kernel and is signaled when > > that work completes. When a sync_file is created, it is guaranteed by > > the kernel that it will become signaled in finite time and, once it's > > signaled, it remains signaled for the rest of time. A sync_file is > > represented in UAPIs as a file descriptor and can be used with normal > > file APIs such as dup(). It can be passed into another UAPI which > > does some bit of queue'd work and the submitted work will wait for the > > sync_file to be triggered before executing. A sync_file also supports > > poll() if you want to wait on it manually. > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU > > drivers support it. Here's a very quick overview of my understanding > > of the status of various components (I don't know the status of > > anything in the media world): > > > > - Vulkan: Explicit synchronization all the way but we have to go > > implicit as soon as we interact with a window-system. Vulkan has APIs > > to import/export sync_files to/from it's VkSemaphore and VkFence > > synchronization primitives. > > - OpenGL: Implicit all the way. There are some EGL extensions to > > enable some forms of explicit sync via sync_file but OpenGL itself is > > still implicit. > > - Wayland: Currently depends on implicit sync in the kernel (accessed > > via EGL/OpenGL). There is an unstable extension to allow passing > > sync_files around but it's questionable how useful it is right now > > (more on that later). > > - X11: With present, it has these "explicit" fence objects but > > they're always a shmfence which lets the X server and client do a > > userspace CPU-side hand-off without going over the socket (and > > round-tripping through the kernel). However, the only thing that > > fence does is order the OpenGL API calls in the client and server and > > the real synchronization is still implicit. > > - linux/i915/gem: Fully supports using sync_file or syncobj for explicit > > sync. > > - linux/amdgpu: Supports sync_file and syncobj but it still > > implicitly syncs sometimes due to it's internal memory residency > > handling which can lead to over-synchronization. > > - KMS: Implicit sync all the way. There are no KMS APIs which take > > explicit sync primitives. > > Correction: Apparently, I missed some things. If you use atomic, KMS > does have explicit in- and out-fences. Non-atomic users (e.g. X11) > are still in trouble but most Wayland compositors use atomic these > days > > > - v4l: ??? > > - gstreamer: ??? > > - Media APIs such as vaapi etc.: ??? GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer synchronisation is something we do already with GL (even if limited). We place GLSync object in the pipeline and attach that on related GstBuffer. We wait on these GLSync as late as possible (or superseed the sync if we queue more work into the same GL context). That requires a special mode of operation of course. We don't usually like making lazy blocking call implicit, as it tends to cause random issues. If we need to wait, we think it's better to wait int he module that is responsible, so in general, we try to negotiate and fallback locally (it's plugin base, so this can be really messy otherwise). So basically this problem needs to be solved in V4L2, VAAPI and other lower level APIs first. We need API that provides us these fence (in or out), and then we can consider using them. For V4L2, there was an attempt, but it was a bit of a miss-fit. Your proposal could work, need to be tested I guess, but it does not solve some of other issues that was discussed. Notably for camera capture, were the HW timestamp is capture about at the same time the frame is ready. But the timestamp is not part of the paylaod, so you need an entire API asynchronously deliver that metadata. It's the biggest pain point I've found, such an API would be quite invasive or if made really generic, might just never be adopted widely enough. There is other elements that would implement fencing, notably kmssink, but no one actually dared porting it to atomic KMS, so clearly there is very little comunity interest. glimagsink could clearly benifit. Right now if we import a DMABuf, and that this DMAbuf is used for render, a implicit fence is attached, which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer uses), so then the operation will just fail where it worked before (breaking userspace). If it was an explcit fence, we could handle that in GStreamer cleanly as we do for new APIs. > > > > > > ## Chicken and egg problems > > > > Ok, this is where it starts getting depressing. I made the claim > > above that Wayland has an explicit synchronization protocol that's of > > questionable usefulness. I would claim that basically any bit of > > plumbing we do through window systems is currently of questionable > > usefulness. Why? > > > > From my perspective, as a Vulkan driver developer, I have to deal with > > the fact that Vulkan is an explicit sync API but Wayland and X11 > > aren't. Unfortunately, the Wayland extension solves zero problems for > > me because I can't really use it unless it's implemented in all of the > > compositors. Until every Wayland compositor I care about my users > > being able to use (which is basically all of them) supports the > > extension, I have to continue carry around my pile of hacks to keep > > implicit sync and Vulkan working nicely together. > > > > From the perspective of a Wayland compositor (I used to play in this > > space), they'd love to implement the new explicit sync extension but > > can't. Sure, they could wire up the extension, but the moment they go > > to flip a client buffer to the screen directly, they discover that KMS > > doesn't support any explicit sync APIs. > > As per the above correction, Wayland compositors aren't nearly as bad > off as I initially thought. There may still be weird screen capture > cases but the normal cases of compositing and displaying via > KMS/atomic should be in reasonably good shape. > > > So, yes, they can technically > > implement the extension assuming the EGL stack they're running on has > > the sync_file extensions but any client buffers which come in using > > the explicit sync Wayland extension have to be composited and can't be > > scanned out directly. As a 3D driver developer, I absolutely don't > > want compositors doing that because my users will complain about > > performance issues due to the extra blit. > > > > Ok, so let's say we get KMS wired up with implicit sync. That solves > > all our problems, right? It does, right up until someone decides that > > they wan to screen capture their Wayland session via some hardware > > media encoder that doesn't support explicit sync. Now we have to > > plumb it all the way through the media stack, gstreamer, etc. Great, > > so let's do that! Oh, but gstreamer won't want to plumb it through > > until they're guaranteed that they can use explicit sync when > > displaying on X11 or Wayland. Are you seeing the problem? > > > > To make matters worse, since most things are doing implicit > > synchronization today, it's really easy to get your explicit > > synchronization wrong and never notice. If you forget to pass a > > sync_file into one place (say you never notice KMS doesn't support > > them), it will probably work anyway thanks to all the implicit sync > > that's going on elsewhere. > > > > So, clearly, we all need to go write piles of code that we can't > > actually properly test until everyone else has written their piece and > > then we use explicit sync if and only if all components support it. > > Really? We're going to do multiple years of development and then just > > hope it works when we finally flip the switch? That doesn't sound > > like a good plan to me. > > > > > > ## A proposal: Implicit and explicit sync together > > > > How to solve all these chicken-and-egg problems is something I've been > > giving quite a bit of thought (and talking with many others about) in > > the last couple of years. One motivation for this is that we have to > > deal with a mismatch in Vulkan. Another motivation is that I'm > > becoming increasingly unhappy with the way that synchronization, > > memory residency, and command submission are inherently intertwined in > > i915 and would like to break things apart. Towards that end, I have > > an actual proposal. > > > > A couple weeks ago, I sent a series of patches to the dri-devel > > mailing list which adds a pair of new ioctls to dma-buf which allow > > userspace to manually import or export a sync_file from a dma-buf. > > The idea is that something like a Wayland compositor can switch to > > 100% explicit sync internally once the ioctl is available. If it gets > > buffers in from a client that doesn't use the explicit sync extension, > > it can pull a sync_file from the dma-buf and use that exactly as it > > would a sync_file passed via the explicit sync extension. When it > > goes to scan out a user buffer and discovers that KMS doesn't accept > > sync_files (or if it tries to use that pesky media encoder no one has > > converted), it can take it's sync_file for display and stuff it into > > the dma-buf before handing it to KMS. > > > > Along with the kernel patches, I've also implemented support for this > > in the Vulkan WSI code used by ANV and RADV. With those patches, the > > only requirement on the Vulkan drivers is that you be able to export > > any VkSemaphore as a sync_file and temporarily import a sync_file into > > any VkFence or VkSemaphore. As long as that works, the core Vulkan > > driver only ever sees explicit synchronization via sync_file. The WSI > > code uses these new ioctls to translate the implicit sync of X11 and > > Wayland to the explicit sync the Vulkan driver wants. > > > > I'm hoping (and here's where I want a sanity check) that a simple API > > like this will allow us to finally start moving the Linux ecosystem > > over to explicit synchronization one piece at a time in a way that's > > actually correct. (No Wayland explicit sync with compositors hoping > > KMS magically works even though it doesn't have a sync_file API.) > > Once some pieces in the ecosystem start moving, there will be > > motivation to start moving others and maybe we can actually build the > > momentum to get most everything converted. > > > > For reference, you can find the kernel RFC patches and mesa MR here: > > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html > > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037 > > > > At this point, I welcome your thoughts, comments, objections, and > > maybe even help/review. :-) > > > > --Jason Ekstrand _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
In reply to this post by Jason Ekstrand
On Wed, 2020-03-11 at 12:31 -0500, Jason Ekstrand wrote:
> - X11: With present, it has these "explicit" fence objects but > they're always a shmfence which lets the X server and client do a > userspace CPU-side hand-off without going over the socket (and > round-tripping through the kernel). However, the only thing that > fence does is order the OpenGL API calls in the client and server and > the real synchronization is still implicit. I'm pretty sure "the only thing that fence does" is an implementation detail. PresentPixmap blocks until the wait-fence signals, but when and how it signals are properties of the fence itself. You could have drm give the client back a fence fd, pass that to xserver to create a fence object, and name that in the PresentPixmap request, and then drm can do whatever it wants to signal the fence. > From my perspective, as a Vulkan driver developer, I have to deal with > the fact that Vulkan is an explicit sync API but Wayland and X11 > aren't. I'm quite sure we can give you an explicit-sync X11 API. I think you may already have one. - ajax _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
It seems I may have not set the tone I intended with this e-mail... My
intention was never to stomp on anyone's favorite window system (Adam, isn't the only one who's seemed a bit miffed). My intention was to try and solve some very real problems that we have with Vulkan and I had the hope that a solution there could be helpful for others. The problem we have in Vulkan is that we have an inherently explicit sync graphics API and we're trying to strap it onto some inherently implicit sync window systems and kernel interfaces. Our mechanisms for doing so have evolved over the course of the last 4-5 years and it's way better now than it was when we started but it's still pretty bad and very invasive to the driver. My objective is to completely remove the concept of implicit sync from the Vulkan driver eventually. Also (and this is going further down the rabbit hole), I would like to begin cleaning up our i915 UAPI to better separate memory residency handling, command submission, and synchronization. Eventually (and this may sound crazy to some), I'd like to get to the point where i915 doesn't own any of the synchronization primitives except what it needs to handle memory management internally. Linux graphics UAPI is about 10 years behind Windows in terms of design (roughly equivalent to Win7) and I think it's costing us in terms of latency and CPU overhead. Some of that may just be implementation problems in i915; some of it may be core API design. It's a bit unclear. Why am I bringing up kernel APIs? Because one of the biggest problems in evolving things is the fact that our kernel APIs are tied to implicit sync on dma-buf. We can't detangle that until we can remove implicit dma-buf signaling from the command execution APIs. This means that we either need to get rid of ALL implicit synchronization from window-system APIs far enough back in time that we don't run the risk of "breaking userspace" or else we need a plan which lets the kernel driver not support implicit sync but make implicit sync work anyway. What I'm proposing with dma-buf sync_file import/export is one such plan. So, while this may not solve any problems for Wayland compositors as I previously thought (KMS/atomic supports sync_file. Yay!), we still have a very real problem in Vulkan. It's great that Wayland has an explicit sync API but until all compositors have supported it for at least 2 years, I can't assume it's existence and start deleting my old code paths. Currently, it's only implemented in Weston and the ChromeOS compositor; gnome-shell, kwin, and sway are all still 100% implicit sync AFAIK. We also have to deal with X11. For those who are asking the question in the back of their minds: Yes, I'm trying to solve a userspace problem with kernel code and, no, I don't think that's necessarily the wrong way around. Don't get me wrong; I very much want to solve the problem "properly" but unless we're very sure we can get it solved properly everywhere quickly, a solution which lets us improve our driver kernel APIs independently of misc. Wayland compositors seems advantageous. On Wed, Mar 11, 2020 at 6:02 PM Adam Jackson <[hidden email]> wrote: > > On Wed, 2020-03-11 at 12:31 -0500, Jason Ekstrand wrote: > > > - X11: With present, it has these "explicit" fence objects but > > they're always a shmfence which lets the X server and client do a > > userspace CPU-side hand-off without going over the socket (and > > round-tripping through the kernel). However, the only thing that > > fence does is order the OpenGL API calls in the client and server and > > the real synchronization is still implicit. > > I'm pretty sure "the only thing that fence does" is an implementation > detail. So I've been told, many times. > PresentPixmap blocks until the wait-fence signals, but when and > how it signals are properties of the fence itself. You could have drm > give the client back a fence fd, pass that to xserver to create a fence > object, and name that in the PresentPixmap request, and then drm can do > whatever it wants to signal the fence. Poking around at things, X11 may not be quite as bad as I thought here. It's not really set up for sync_file for a couple reasons: 1. It only passes the file descriptor in once at xcb_dri3_fence_from_fd rather than re-creating every frame from a new sync_file 2. It only takes a fence on present and doesn't return one in the PRESENT_COMPLETE event That said, plumbing syncobj in as an extension looks like a real possibility. A syncobj is just a container that holds a pointer to a dma_fence and it has roughly the same CPU signal/reset behavior that's exposed by the SyncFenceFuncsRec struct. There's a few things I'm not sure how to handle: 1. The Sync extension has these trigger funcs which get called when the fence is signalled. I'm not sure how to handle that with syncobj without a thread polling on them somehow. 2. Not all kernel GPU drivers support syncobj; currently it's just i915, amdgpu, and maybe freedreno AFAIK. How do we handle cases such as Intel+Nvidia? 3. I have no idea what kinds of issues we'd run into with plumbing it all through. Hopefully, X is sufficiently abstracted but I really don't know. Please excuse my trepidation but I've got a bit of PTSD from modifiers. That was the last time I tried to solve a problem with someone writing X11 patches and it's been 2-3 years and it's still not shipping in distros. If said syncobj extension suffers the same fate, it isn't a real solution. > > From my perspective, as a Vulkan driver developer, I have to deal with > > the fact that Vulkan is an explicit sync API but Wayland and X11 > > aren't. > > I'm quite sure we can give you an explicit-sync X11 API. I think you > may already have one. It looks like we at least have a bunch of pieces which can probably be used to build one. --Jason _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
In reply to this post by Jason Ekstrand
On Thu, Mar 12, 2020 at 6:36 PM Jason Ekstrand <[hidden email]> wrote:
> From the perspective of a Wayland compositor (I used to play in this > space), they'd love to implement the new explicit sync extension but > can't. Sure, they could wire up the extension, but the moment they go > to flip a client buffer to the screen directly, they discover that KMS > doesn't support any explicit sync APIs. So, yes, they can technically > implement the extension assuming the EGL stack they're running on has > the sync_file extensions but any client buffers which come in using > the explicit sync Wayland extension have to be composited and can't be > scanned out directly. As a 3D driver developer, I absolutely don't > want compositors doing that because my users will complain about > performance issues due to the extra blit. <troll> Maybe this is something for the Marketing Department to solve? Sell the extra processing that can be done during such extra blit as a feature? As a former user of a wide-gamut monitor that has no sRGB mode, and a gamer, I would definitely accept the extra step (color conversion, not "just a blit"!) between the application and the actual output. In fact, I have set up compicc just for this purpose. Games with poisonous oversaturated colors (because none of the game authors care about wide-gamut monitors) are worse than the same games affected by the very small performance penalty due to the conversion. We just need a Marketing Person to come up with a huge list of other cases where such compositing step is required for correctness, and declare that direct scanout is something that makes no sense in the present day, except possibly on embedded devices. </troll> Of course the above trolling does not solve the problem related to inability to be sure about the correct API usage. -- Alexander E. Patrakov CV: http://pc.cd/PLz7 _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
In reply to this post by Jason Ekstrand
There is no synchronization between processes (e.g. 3D app and compositor) within X on AMD hw. It works because of some hacks in Mesa. Marek On Wed, Mar 11, 2020 at 1:31 PM Jason Ekstrand <[hidden email]> wrote: All, _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
Could you elaborate. If there's something missing from my mental model of how implicit sync works, I'd like to have it corrected. People continue claiming that AMD is somehow special but I have yet to grasp what makes it so. (Not that anyone has bothered to try all that hard to explain it.) --Jason On March 13, 2020 21:03:21 Marek Olšák <[hidden email]> wrote:
_______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
The synchronization works because the Mesa driver waits for idle (drains the GFX pipeline) at the end of command buffers and there is only 1 graphics queue, so everything is ordered. The GFX pipeline runs asynchronously to the command buffer, meaning the command buffer only starts draws and doesn't wait for completion. If the Mesa driver didn't wait at the end of the command buffer, the command buffer would finish and a different process could start execution of its own command buffer while shaders of the previous process are still running. If the Mesa driver submits a command buffer internally (because it's full), it doesn't wait, so the GFX pipeline doesn't notice that a command buffer ended and a new one started. The waiting at the end of command buffers happens only when the flush is external (Swap buffers, glFlush). It's a performance problem, because the GFX queue is blocked until the GFX pipeline is drained at the end of every frame at least. So explicit fences for SwapBuffers would help. Marek On Sun., Mar. 15, 2020, 22:49 Jason Ekstrand, <[hidden email]> wrote:
_______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> The synchronization works because the Mesa driver waits for idle (drains > the GFX pipeline) at the end of command buffers and there is only 1 > graphics queue, so everything is ordered. > > The GFX pipeline runs asynchronously to the command buffer, meaning the > command buffer only starts draws and doesn't wait for completion. If the > Mesa driver didn't wait at the end of the command buffer, the command > buffer would finish and a different process could start execution of its > own command buffer while shaders of the previous process are still running. > > If the Mesa driver submits a command buffer internally (because it's full), > it doesn't wait, so the GFX pipeline doesn't notice that a command buffer > ended and a new one started. > > The waiting at the end of command buffers happens only when the flush is > external (Swap buffers, glFlush). > > It's a performance problem, because the GFX queue is blocked until the GFX > pipeline is drained at the end of every frame at least. > > So explicit fences for SwapBuffers would help. Not sure what difference it would make, since the same thing needs to be done for explicit fences as well, doesn't it? -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
In reply to this post by Nicolas Dufresne-5
On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> (I know I'm going to be spammed by so many mailing list ...) > > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit : > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <[hidden email]> wrote: > > > All, > > > > > > Sorry for casting such a broad net with this one. I'm sure most people > > > who reply will get at least one mailing list rejection. However, this > > > is an issue that affects a LOT of components and that's why it's > > > thorny to begin with. Please pardon the length of this e-mail as > > > well; I promise there's a concrete point/proposal at the end. > > > > > > > > > Explicit synchronization is the future of graphics and media. At > > > least, that seems to be the consensus among all the graphics people > > > I've talked to. I had a chat with one of the lead Android graphics > > > engineers recently who told me that doing explicit sync from the start > > > was one of the best engineering decisions Android ever made. It's > > > also the direction being taken by more modern APIs such as Vulkan. > > > > > > > > > ## What are implicit and explicit synchronization? > > > > > > For those that aren't familiar with this space, GPUs, media encoders, > > > etc. are massively parallel and synchronization of some form is > > > required to ensure that everything happens in the right order and > > > avoid data races. Implicit synchronization is when bits of work (3D, > > > compute, video encode, etc.) are implicitly based on the absolute > > > CPU-time order in which API calls occur. Explicit synchronization is > > > when the client (whatever that means in any given context) provides > > > the dependency graph explicitly via some sort of synchronization > > > primitives. If you're still confused, consider the following > > > examples: > > > > > > With OpenGL and EGL, almost everything is implicit sync. Say you have > > > two OpenGL contexts sharing an image where one writes to it and the > > > other textures from it. The way the OpenGL spec works, the client has > > > to make the API calls to render to the image before (in CPU time) it > > > makes the API calls which texture from the image. As long as it does > > > this (and maybe inserts a glFlush?), the driver will ensure that the > > > rendering completes before the texturing happens and you get correct > > > contents. > > > > > > Implicit synchronization can also happen across processes. Wayland, > > > for instance, is currently built on implicit sync where the client > > > does their rendering and then does a hand-off (via wl_surface::commit) > > > to tell the compositor it's done at which point the compositor can now > > > texture from the surface. The hand-off ensures that the client's > > > OpenGL API calls happen before the server's OpenGL API calls. > > > > > > A good example of explicit synchronization is the Vulkan API. There, > > > a client (or multiple clients) can simultaneously build command > > > buffers in different threads where one of those command buffers > > > renders to an image and the other textures from it and then submit > > > both of them at the same time with instructions to the driver for > > > which order to execute them in. The execution order is described via > > > the VkSemaphore primitive. With the new VK_KHR_timeline_semaphore > > > extension, you can even submit the work which does the texturing > > > BEFORE the work which does the rendering and the driver will sort it > > > out. > > > > > > The #1 problem with implicit synchronization (which explicit solves) > > > is that it leads to a lot of over-synchronization both in client space > > > and in driver/device space. The client has to synchronize a lot more > > > because it has to ensure that the API calls happen in a particular > > > order. The driver/device have to synchronize a lot more because they > > > never know what is going to end up being a synchronization point as an > > > API call on another thread/process may occur at any time. As we move > > > to more and more multi-threaded programming this synchronization (on > > > the client-side especially) becomes more and more painful. > > > > > > > > > ## Current status in Linux > > > > > > Implicit synchronization in Linux works via a the kernel's internal > > > dma_buf and dma_fence data structures. A dma_fence is a tiny object > > > which represents the "done" status for some bit of work. Typically, > > > dma_fences are created as a by-product of someone submitting some bit > > > of work (say, 3D rendering) to the kernel. The dma_buf object has a > > > set of dma_fences on it representing shared (read) and exclusive > > > (write) access to the object. When work is submitted which, for > > > instance renders to the dma_buf, it's queued waiting on all the fences > > > on the dma_buf and and a dma_fence is created representing the end of > > > said rendering work and it's installed as the dma_buf's exclusive > > > fence. This way, the kernel can manage all its internal queues (3D > > > rendering, display, video encode, etc.) and know which things to > > > submit in what order. > > > > > > For the last few years, we've had sync_file in the kernel and it's > > > plumbed into some drivers. A sync_file is just a wrapper around a > > > single dma_fence. A sync_file is typically created as a by-product of > > > submitting work (3D, compute, etc.) to the kernel and is signaled when > > > that work completes. When a sync_file is created, it is guaranteed by > > > the kernel that it will become signaled in finite time and, once it's > > > signaled, it remains signaled for the rest of time. A sync_file is > > > represented in UAPIs as a file descriptor and can be used with normal > > > file APIs such as dup(). It can be passed into another UAPI which > > > does some bit of queue'd work and the submitted work will wait for the > > > sync_file to be triggered before executing. A sync_file also supports > > > poll() if you want to wait on it manually. > > > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU > > > drivers support it. Here's a very quick overview of my understanding > > > of the status of various components (I don't know the status of > > > anything in the media world): > > > > > > - Vulkan: Explicit synchronization all the way but we have to go > > > implicit as soon as we interact with a window-system. Vulkan has APIs > > > to import/export sync_files to/from it's VkSemaphore and VkFence > > > synchronization primitives. > > > - OpenGL: Implicit all the way. There are some EGL extensions to > > > enable some forms of explicit sync via sync_file but OpenGL itself is > > > still implicit. > > > - Wayland: Currently depends on implicit sync in the kernel (accessed > > > via EGL/OpenGL). There is an unstable extension to allow passing > > > sync_files around but it's questionable how useful it is right now > > > (more on that later). > > > - X11: With present, it has these "explicit" fence objects but > > > they're always a shmfence which lets the X server and client do a > > > userspace CPU-side hand-off without going over the socket (and > > > round-tripping through the kernel). However, the only thing that > > > fence does is order the OpenGL API calls in the client and server and > > > the real synchronization is still implicit. > > > - linux/i915/gem: Fully supports using sync_file or syncobj for explicit > > > sync. > > > - linux/amdgpu: Supports sync_file and syncobj but it still > > > implicitly syncs sometimes due to it's internal memory residency > > > handling which can lead to over-synchronization. > > > - KMS: Implicit sync all the way. There are no KMS APIs which take > > > explicit sync primitives. > > > > Correction: Apparently, I missed some things. If you use atomic, KMS > > does have explicit in- and out-fences. Non-atomic users (e.g. X11) > > are still in trouble but most Wayland compositors use atomic these > > days > > > > > - v4l: ??? > > > - gstreamer: ??? > > > - Media APIs such as vaapi etc.: ??? > > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer > synchronisation is something we do already with GL (even if limited). We place > GLSync object in the pipeline and attach that on related GstBuffer. We wait on > these GLSync as late as possible (or superseed the sync if we queue more work > into the same GL context). That requires a special mode of operation of course. > We don't usually like making lazy blocking call implicit, as it tends to cause > random issues. If we need to wait, we think it's better to wait int he module > that is responsible, so in general, we try to negotiate and fallback locally > (it's plugin base, so this can be really messy otherwise). > > So basically this problem needs to be solved in V4L2, VAAPI and other lower > level APIs first. We need API that provides us these fence (in or out), and then > we can consider using them. For V4L2, there was an attempt, but it was a bit of > a miss-fit. Your proposal could work, need to be tested I guess, but it does not > solve some of other issues that was discussed. Notably for camera capture, were > the HW timestamp is capture about at the same time the frame is ready. But the > timestamp is not part of the paylaod, so you need an entire API asynchronously > deliver that metadata. It's the biggest pain point I've found, such an API would > be quite invasive or if made really generic, might just never be adopted widely > enough. Another issue is that V4L2 doesn't offer any guarantee on job ordering. When you queue multiple buffers for camera capture for instance, you don't know until capture complete in which buffer the frame has been captured. In the normal case buffers are processed in sequence, but if an error occurs during capture, they can be recycled internally and put to the back of the queue. Unless I'm mistaken, this problem also exists with stateful codecs. And if you don't know in advance which buffer you will receive from the device, the usefulness of fences becomes very questionable :-) > There is other elements that would implement fencing, notably kmssink, but no > one actually dared porting it to atomic KMS, so clearly there is very little > comunity interest. glimagsink could clearly benifit. Right now if we import a > DMABuf, and that this DMAbuf is used for render, a implicit fence is attached, > which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer > uses), so then the operation will just fail where it worked before (breaking > userspace). If it was an explcit fence, we could handle that in GStreamer > cleanly as we do for new APIs. > > > > ## Chicken and egg problems > > > > > > Ok, this is where it starts getting depressing. I made the claim > > > above that Wayland has an explicit synchronization protocol that's of > > > questionable usefulness. I would claim that basically any bit of > > > plumbing we do through window systems is currently of questionable > > > usefulness. Why? > > > > > > From my perspective, as a Vulkan driver developer, I have to deal with > > > the fact that Vulkan is an explicit sync API but Wayland and X11 > > > aren't. Unfortunately, the Wayland extension solves zero problems for > > > me because I can't really use it unless it's implemented in all of the > > > compositors. Until every Wayland compositor I care about my users > > > being able to use (which is basically all of them) supports the > > > extension, I have to continue carry around my pile of hacks to keep > > > implicit sync and Vulkan working nicely together. > > > > > > From the perspective of a Wayland compositor (I used to play in this > > > space), they'd love to implement the new explicit sync extension but > > > can't. Sure, they could wire up the extension, but the moment they go > > > to flip a client buffer to the screen directly, they discover that KMS > > > doesn't support any explicit sync APIs. > > > > As per the above correction, Wayland compositors aren't nearly as bad > > off as I initially thought. There may still be weird screen capture > > cases but the normal cases of compositing and displaying via > > KMS/atomic should be in reasonably good shape. > > > > > So, yes, they can technically > > > implement the extension assuming the EGL stack they're running on has > > > the sync_file extensions but any client buffers which come in using > > > the explicit sync Wayland extension have to be composited and can't be > > > scanned out directly. As a 3D driver developer, I absolutely don't > > > want compositors doing that because my users will complain about > > > performance issues due to the extra blit. > > > > > > Ok, so let's say we get KMS wired up with implicit sync. That solves > > > all our problems, right? It does, right up until someone decides that > > > they wan to screen capture their Wayland session via some hardware > > > media encoder that doesn't support explicit sync. Now we have to > > > plumb it all the way through the media stack, gstreamer, etc. Great, > > > so let's do that! Oh, but gstreamer won't want to plumb it through > > > until they're guaranteed that they can use explicit sync when > > > displaying on X11 or Wayland. Are you seeing the problem? > > > > > > To make matters worse, since most things are doing implicit > > > synchronization today, it's really easy to get your explicit > > > synchronization wrong and never notice. If you forget to pass a > > > sync_file into one place (say you never notice KMS doesn't support > > > them), it will probably work anyway thanks to all the implicit sync > > > that's going on elsewhere. > > > > > > So, clearly, we all need to go write piles of code that we can't > > > actually properly test until everyone else has written their piece and > > > then we use explicit sync if and only if all components support it. > > > Really? We're going to do multiple years of development and then just > > > hope it works when we finally flip the switch? That doesn't sound > > > like a good plan to me. > > > > > > > > > ## A proposal: Implicit and explicit sync together > > > > > > How to solve all these chicken-and-egg problems is something I've been > > > giving quite a bit of thought (and talking with many others about) in > > > the last couple of years. One motivation for this is that we have to > > > deal with a mismatch in Vulkan. Another motivation is that I'm > > > becoming increasingly unhappy with the way that synchronization, > > > memory residency, and command submission are inherently intertwined in > > > i915 and would like to break things apart. Towards that end, I have > > > an actual proposal. > > > > > > A couple weeks ago, I sent a series of patches to the dri-devel > > > mailing list which adds a pair of new ioctls to dma-buf which allow > > > userspace to manually import or export a sync_file from a dma-buf. > > > The idea is that something like a Wayland compositor can switch to > > > 100% explicit sync internally once the ioctl is available. If it gets > > > buffers in from a client that doesn't use the explicit sync extension, > > > it can pull a sync_file from the dma-buf and use that exactly as it > > > would a sync_file passed via the explicit sync extension. When it > > > goes to scan out a user buffer and discovers that KMS doesn't accept > > > sync_files (or if it tries to use that pesky media encoder no one has > > > converted), it can take it's sync_file for display and stuff it into > > > the dma-buf before handing it to KMS. > > > > > > Along with the kernel patches, I've also implemented support for this > > > in the Vulkan WSI code used by ANV and RADV. With those patches, the > > > only requirement on the Vulkan drivers is that you be able to export > > > any VkSemaphore as a sync_file and temporarily import a sync_file into > > > any VkFence or VkSemaphore. As long as that works, the core Vulkan > > > driver only ever sees explicit synchronization via sync_file. The WSI > > > code uses these new ioctls to translate the implicit sync of X11 and > > > Wayland to the explicit sync the Vulkan driver wants. > > > > > > I'm hoping (and here's where I want a sanity check) that a simple API > > > like this will allow us to finally start moving the Linux ecosystem > > > over to explicit synchronization one piece at a time in a way that's > > > actually correct. (No Wayland explicit sync with compositors hoping > > > KMS magically works even though it doesn't have a sync_file API.) > > > Once some pieces in the ecosystem start moving, there will be > > > motivation to start moving others and maybe we can actually build the > > > momentum to get most everything converted. > > > > > > For reference, you can find the kernel RFC patches and mesa MR here: > > > > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037 > > > > > > At this point, I welcome your thoughts, comments, objections, and > > > maybe even help/review. :-) > > > > > > --Jason Ekstrand > -- Regards, Laurent Pinchart _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
Hi Jason, I've been wrestling with the sync problems in Wayland some time ago, but only with regards to 3D drivers. The guarantee given by the GL/GLES spec is limited to a single graphics context. If the same buffer is accessed by 2 contexts the outcome is unspecified. The cross-context and cross-process synchronisation is not guaranteed. It happens to work on Mesa, because the read/write locking is implemented in the kernel space, but it didn't work on Broadcom driver, which has read-write interlocks in user space.  A Vulkan client makes it even worse because of conflicting requirements: Vulkan's vkQueuePresentKHR() passes in a number of semaphores but disallows waiting. Wayland WSI requires wl_surface_commit() to be called from vkQueuePresentKHR() which does require a wait, unless a synchronisation primitive representing Vulkan samaphores is passed between Vulkan client and the compositor. The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver). With regards to V4L2, I believe it could easily work the same way as 3D drivers, i.e. pass a buffer+fence pair to the next stage. The encode always succeeds, but for capture or decode, the main problem is the uncertain outcome, I believe? If we're fine with rendering or displaying an occasional broken frame, then buffer+fence pair would work too. The broken frame will go into the pipeline, but application can drain the pipeline and start over once the capture works again. To answer some points raised by Laurent (although I'm unfamiliar with the camera drivers): > you don't know until capture complete in which buffer the frame has been captured Surely you do, you only don't know in advance if the capture will be successful > but if an error occurs during capture, they can be recycled internally and put to the back of the queue. That would have to change in order to use explicit synchronisation. Every started capture becomes immediately available as a buffer+fence pair. Fence is signalled once the capture is finished (successfully or otherwise). The buffer must not be reused until it's released, possibly with another fence - in that case the buffer must not be reused until the release fence is signalled. Cheers, Tomek On Mon, 16 Mar 2020 at 10:20, Laurent Pinchart <[hidden email]> wrote: On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote: _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
Hi Tomek,
On Mon, Mar 16, 2020 at 12:55:27PM +0000, Tomek Bury wrote: > Hi Jason, > > I've been wrestling with the sync problems in Wayland some time ago, but only > with regards to 3D drivers. > > The guarantee given by the GL/GLES spec is limited to a single graphics > context. If the same buffer is accessed by 2 contexts the outcome is > unspecified. The cross-context and cross-process synchronisation is not > guaranteed. It happens to work on Mesa, because the read/write locking is > implemented in the kernel space, but it didn't work on Broadcom driver, which > has read-write interlocks in user space. > >  A Vulkan client makes it even worse because of conflicting requirements: > Vulkan's vkQueuePresentKHR() passes in a number of semaphores but disallows > waiting. Wayland WSI requires wl_surface_commit() to be called from > vkQueuePresentKHR() which does require a wait, unless a synchronisation > primitive representing Vulkan samaphores is passed between Vulkan client and > the compositor. > > The most troublesome part was Wayland buffer release mechanism, as it only > involves a CPU signalling over Wayland IPC, without any 3D driver involvement. > The choices were: explicit synchronisation extension or a buffer copy in the > compositor (i.e. compositor textures from the copy, so the client can re-write > the original), or some implicit synchronisation in kernel space (but that > wasn't an option in Broadcom driver). > > With regards to V4L2, I believe it could easily work the same way as 3D > drivers, i.e. pass a buffer+fence pair to the next stage. The encode always > succeeds, but for capture or decode, the main problem is the uncertain outcome, > I believe? If we're fine with rendering or displaying an occasional broken > frame, then buffer+fence pair would work too. The broken frame will go into the > pipeline, but application can drain the pipeline and start over once the > capture works again. > > To answer some points raised by Laurent (although I'm unfamiliar with the > camera drivers): > > > you don't know until capture complete in which buffer the frame has > > been captured > > Surely you do, you only don't know in advance if the capture will be successful You do in kernelspace, but not in userspace at the moment, due to buffer recycling. > > but if an error occurs during capture, they can be recycled internally and > > put to the back of the queue. > > That would have to change in order to use explicit synchronisation. Every > started capture becomes immediately available as a buffer+fence pair. Fence is > signalled once the capture is finished (successfully or otherwise). The buffer > must not be reused until it's released, possibly with another fence - in that > case the buffer must not be reused until the release fence is signalled. We could certainly change this at least in some cases, but it would break existing userspace that doesn't expect incorrect frames. I'm however not sure we could change this behaviour in every case, there may be hardware that can't provide a guarantee on the order in which buffers will be used. I'm aware this wouldn't be compatible with explicit synchronization, and that's my point: camera hardware may not always support explicit synchronization. As long as we can fall back to not using fences then we should be fine. -- Regards, Laurent Pinchart _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
> As long as we can fall back to not using fences then we should be fine.
Buffers written by the camera are trivial because you control what happens - just don't attach fence, so that the capture can be used immediately. For recycled buffers there's an extra bit of work to do because won't be up to camera driver to decide whether the buffer comes back with or without fence. _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
In reply to this post by Tomek Bury
Hi Tomek,
On Mon, 16 Mar 2020 at 12:55, Tomek Bury <[hidden email]> wrote: > I've been wrestling with the sync problems in Wayland some time ago, but only with regards to 3D drivers. > > The guarantee given by the GL/GLES spec is limited to a single graphics context. If the same buffer is accessed by 2 contexts the outcome is unspecified. The cross-context and cross-process synchronisation is not guaranteed. It happens to work on Mesa, because the read/write locking is implemented in the kernel space, but it didn't work on Broadcom driver, which has read-write interlocks in user space. GL and GLES are not relevant. What is relevant is EGL, which defines interfaces to make things work on the native platform. EGL doesn't define any kind of synchronisation model for the Wayland, X11, or GBM/KMS platforms - but it's one of the things which has to work. It doesn't say that the implementation must make sure that the requested format is displayable, but you sort of take it for granted that if you ask EGL to display something it will do so. Synchronisation is one of those mechanisms which is left to the platform to implement under the hood. In the absence of platform support for explicit synchronisation, the synchronisation must be implicit. > A Vulkan client makes it even worse because of conflicting requirements: Vulkan's vkQueuePresentKHR() passes in a number of semaphores but disallows waiting. Wayland WSI requires wl_surface_commit() to be called from vkQueuePresentKHR() which does require a wait, unless a synchronisation primitive representing Vulkan samaphores is passed between Vulkan client and the compositor. If you are using EGL_WL_bind_wayland_display, then one of the things it is explicitly allowed/expected to do is to create a Wayland protocol interface between client and compositor, which can be used to pass buffer handles and metadata in a platform-specific way. Adding synchronisation is also possible. > The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver). You can add your own explicit synchronisation extension. In every cross-process and cross-subsystem usecase, synchronisation is obviously required. The two options for this are to implement kernel support for implicit synchronisation (as everyone else has done), or implement generic support for explicit synchronisation (as we have been working on with implementations inside Weston and Exosphere at least), or implement private support for explicit synchronisation, or do nothing and then be surprised at the lack of synchronisation. Cheers, Daniel _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
In reply to this post by Laurent Pinchart
On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart
<[hidden email]> wrote: > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote: > > (I know I'm going to be spammed by so many mailing list ...) > > > > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit : > > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <[hidden email]> wrote: > > > > All, > > > > > > > > Sorry for casting such a broad net with this one. I'm sure most people > > > > who reply will get at least one mailing list rejection. However, this > > > > is an issue that affects a LOT of components and that's why it's > > > > thorny to begin with. Please pardon the length of this e-mail as > > > > well; I promise there's a concrete point/proposal at the end. > > > > > > > > > > > > Explicit synchronization is the future of graphics and media. At > > > > least, that seems to be the consensus among all the graphics people > > > > I've talked to. I had a chat with one of the lead Android graphics > > > > engineers recently who told me that doing explicit sync from the start > > > > was one of the best engineering decisions Android ever made. It's > > > > also the direction being taken by more modern APIs such as Vulkan. > > > > > > > > > > > > ## What are implicit and explicit synchronization? > > > > > > > > For those that aren't familiar with this space, GPUs, media encoders, > > > > etc. are massively parallel and synchronization of some form is > > > > required to ensure that everything happens in the right order and > > > > avoid data races. Implicit synchronization is when bits of work (3D, > > > > compute, video encode, etc.) are implicitly based on the absolute > > > > CPU-time order in which API calls occur. Explicit synchronization is > > > > when the client (whatever that means in any given context) provides > > > > the dependency graph explicitly via some sort of synchronization > > > > primitives. If you're still confused, consider the following > > > > examples: > > > > > > > > With OpenGL and EGL, almost everything is implicit sync. Say you have > > > > two OpenGL contexts sharing an image where one writes to it and the > > > > other textures from it. The way the OpenGL spec works, the client has > > > > to make the API calls to render to the image before (in CPU time) it > > > > makes the API calls which texture from the image. As long as it does > > > > this (and maybe inserts a glFlush?), the driver will ensure that the > > > > rendering completes before the texturing happens and you get correct > > > > contents. > > > > > > > > Implicit synchronization can also happen across processes. Wayland, > > > > for instance, is currently built on implicit sync where the client > > > > does their rendering and then does a hand-off (via wl_surface::commit) > > > > to tell the compositor it's done at which point the compositor can now > > > > texture from the surface. The hand-off ensures that the client's > > > > OpenGL API calls happen before the server's OpenGL API calls. > > > > > > > > A good example of explicit synchronization is the Vulkan API. There, > > > > a client (or multiple clients) can simultaneously build command > > > > buffers in different threads where one of those command buffers > > > > renders to an image and the other textures from it and then submit > > > > both of them at the same time with instructions to the driver for > > > > which order to execute them in. The execution order is described via > > > > the VkSemaphore primitive. With the new VK_KHR_timeline_semaphore > > > > extension, you can even submit the work which does the texturing > > > > BEFORE the work which does the rendering and the driver will sort it > > > > out. > > > > > > > > The #1 problem with implicit synchronization (which explicit solves) > > > > is that it leads to a lot of over-synchronization both in client space > > > > and in driver/device space. The client has to synchronize a lot more > > > > because it has to ensure that the API calls happen in a particular > > > > order. The driver/device have to synchronize a lot more because they > > > > never know what is going to end up being a synchronization point as an > > > > API call on another thread/process may occur at any time. As we move > > > > to more and more multi-threaded programming this synchronization (on > > > > the client-side especially) becomes more and more painful. > > > > > > > > > > > > ## Current status in Linux > > > > > > > > Implicit synchronization in Linux works via a the kernel's internal > > > > dma_buf and dma_fence data structures. A dma_fence is a tiny object > > > > which represents the "done" status for some bit of work. Typically, > > > > dma_fences are created as a by-product of someone submitting some bit > > > > of work (say, 3D rendering) to the kernel. The dma_buf object has a > > > > set of dma_fences on it representing shared (read) and exclusive > > > > (write) access to the object. When work is submitted which, for > > > > instance renders to the dma_buf, it's queued waiting on all the fences > > > > on the dma_buf and and a dma_fence is created representing the end of > > > > said rendering work and it's installed as the dma_buf's exclusive > > > > fence. This way, the kernel can manage all its internal queues (3D > > > > rendering, display, video encode, etc.) and know which things to > > > > submit in what order. > > > > > > > > For the last few years, we've had sync_file in the kernel and it's > > > > plumbed into some drivers. A sync_file is just a wrapper around a > > > > single dma_fence. A sync_file is typically created as a by-product of > > > > submitting work (3D, compute, etc.) to the kernel and is signaled when > > > > that work completes. When a sync_file is created, it is guaranteed by > > > > the kernel that it will become signaled in finite time and, once it's > > > > signaled, it remains signaled for the rest of time. A sync_file is > > > > represented in UAPIs as a file descriptor and can be used with normal > > > > file APIs such as dup(). It can be passed into another UAPI which > > > > does some bit of queue'd work and the submitted work will wait for the > > > > sync_file to be triggered before executing. A sync_file also supports > > > > poll() if you want to wait on it manually. > > > > > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU > > > > drivers support it. Here's a very quick overview of my understanding > > > > of the status of various components (I don't know the status of > > > > anything in the media world): > > > > > > > > - Vulkan: Explicit synchronization all the way but we have to go > > > > implicit as soon as we interact with a window-system. Vulkan has APIs > > > > to import/export sync_files to/from it's VkSemaphore and VkFence > > > > synchronization primitives. > > > > - OpenGL: Implicit all the way. There are some EGL extensions to > > > > enable some forms of explicit sync via sync_file but OpenGL itself is > > > > still implicit. > > > > - Wayland: Currently depends on implicit sync in the kernel (accessed > > > > via EGL/OpenGL). There is an unstable extension to allow passing > > > > sync_files around but it's questionable how useful it is right now > > > > (more on that later). > > > > - X11: With present, it has these "explicit" fence objects but > > > > they're always a shmfence which lets the X server and client do a > > > > userspace CPU-side hand-off without going over the socket (and > > > > round-tripping through the kernel). However, the only thing that > > > > fence does is order the OpenGL API calls in the client and server and > > > > the real synchronization is still implicit. > > > > - linux/i915/gem: Fully supports using sync_file or syncobj for explicit > > > > sync. > > > > - linux/amdgpu: Supports sync_file and syncobj but it still > > > > implicitly syncs sometimes due to it's internal memory residency > > > > handling which can lead to over-synchronization. > > > > - KMS: Implicit sync all the way. There are no KMS APIs which take > > > > explicit sync primitives. > > > > > > Correction: Apparently, I missed some things. If you use atomic, KMS > > > does have explicit in- and out-fences. Non-atomic users (e.g. X11) > > > are still in trouble but most Wayland compositors use atomic these > > > days > > > > > > > - v4l: ??? > > > > - gstreamer: ??? > > > > - Media APIs such as vaapi etc.: ??? > > > > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer > > synchronisation is something we do already with GL (even if limited). We place > > GLSync object in the pipeline and attach that on related GstBuffer. We wait on > > these GLSync as late as possible (or superseed the sync if we queue more work > > into the same GL context). That requires a special mode of operation of course. > > We don't usually like making lazy blocking call implicit, as it tends to cause > > random issues. If we need to wait, we think it's better to wait int he module > > that is responsible, so in general, we try to negotiate and fallback locally > > (it's plugin base, so this can be really messy otherwise). > > > > So basically this problem needs to be solved in V4L2, VAAPI and other lower > > level APIs first. We need API that provides us these fence (in or out), and then > > we can consider using them. For V4L2, there was an attempt, but it was a bit of > > a miss-fit. Your proposal could work, need to be tested I guess, but it does not > > solve some of other issues that was discussed. Notably for camera capture, were > > the HW timestamp is capture about at the same time the frame is ready. But the > > timestamp is not part of the paylaod, so you need an entire API asynchronously > > deliver that metadata. It's the biggest pain point I've found, such an API would > > be quite invasive or if made really generic, might just never be adopted widely > > enough. > > Another issue is that V4L2 doesn't offer any guarantee on job ordering. > When you queue multiple buffers for camera capture for instance, you > don't know until capture complete in which buffer the frame has been > captured. Is this a Kernel UAPI issue? Surely the kernel driver knows at the start of frame capture which buffer it's getting written into. I would think that the kernel APIs could be adjusted (if we find good reason to do so!) such that they return earlier and return a (buffer, fence) pair. Am I missing something fundamental about video here? I must admit that V4L is a bit of an odd case since the kernel driver is the producer and not the consumer. > In the normal case buffers are processed in sequence, but if > an error occurs during capture, they can be recycled internally and put > to the back of the queue. Are those errors something that can happen at any time in the middle of a frame capture? If so, that does make things stickier. > Unless I'm mistaken, this problem also exists > with stateful codecs. And if you don't know in advance which buffer you > will receive from the device, the usefulness of fences becomes very > questionable :-) Yeah, if you really are in a situation where there's no way to know until the full frame capture has been completed which buffer is next, then fences are useless. You aren't in an implicit synchronization setting either; you're in a "full flush" setting. It's arguably worse for performance but perhaps unavoidable? Trying to understand. :-) --Jason > > There is other elements that would implement fencing, notably kmssink, but no > > one actually dared porting it to atomic KMS, so clearly there is very little > > comunity interest. glimagsink could clearly benifit. Right now if we import a > > DMABuf, and that this DMAbuf is used for render, a implicit fence is attached, > > which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would > > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer > > uses), so then the operation will just fail where it worked before (breaking > > userspace). If it was an explcit fence, we could handle that in GStreamer > > cleanly as we do for new APIs. > > > > > > ## Chicken and egg problems > > > > > > > > Ok, this is where it starts getting depressing. I made the claim > > > > above that Wayland has an explicit synchronization protocol that's of > > > > questionable usefulness. I would claim that basically any bit of > > > > plumbing we do through window systems is currently of questionable > > > > usefulness. Why? > > > > > > > > From my perspective, as a Vulkan driver developer, I have to deal with > > > > the fact that Vulkan is an explicit sync API but Wayland and X11 > > > > aren't. Unfortunately, the Wayland extension solves zero problems for > > > > me because I can't really use it unless it's implemented in all of the > > > > compositors. Until every Wayland compositor I care about my users > > > > being able to use (which is basically all of them) supports the > > > > extension, I have to continue carry around my pile of hacks to keep > > > > implicit sync and Vulkan working nicely together. > > > > > > > > From the perspective of a Wayland compositor (I used to play in this > > > > space), they'd love to implement the new explicit sync extension but > > > > can't. Sure, they could wire up the extension, but the moment they go > > > > to flip a client buffer to the screen directly, they discover that KMS > > > > doesn't support any explicit sync APIs. > > > > > > As per the above correction, Wayland compositors aren't nearly as bad > > > off as I initially thought. There may still be weird screen capture > > > cases but the normal cases of compositing and displaying via > > > KMS/atomic should be in reasonably good shape. > > > > > > > So, yes, they can technically > > > > implement the extension assuming the EGL stack they're running on has > > > > the sync_file extensions but any client buffers which come in using > > > > the explicit sync Wayland extension have to be composited and can't be > > > > scanned out directly. As a 3D driver developer, I absolutely don't > > > > want compositors doing that because my users will complain about > > > > performance issues due to the extra blit. > > > > > > > > Ok, so let's say we get KMS wired up with implicit sync. That solves > > > > all our problems, right? It does, right up until someone decides that > > > > they wan to screen capture their Wayland session via some hardware > > > > media encoder that doesn't support explicit sync. Now we have to > > > > plumb it all the way through the media stack, gstreamer, etc. Great, > > > > so let's do that! Oh, but gstreamer won't want to plumb it through > > > > until they're guaranteed that they can use explicit sync when > > > > displaying on X11 or Wayland. Are you seeing the problem? > > > > > > > > To make matters worse, since most things are doing implicit > > > > synchronization today, it's really easy to get your explicit > > > > synchronization wrong and never notice. If you forget to pass a > > > > sync_file into one place (say you never notice KMS doesn't support > > > > them), it will probably work anyway thanks to all the implicit sync > > > > that's going on elsewhere. > > > > > > > > So, clearly, we all need to go write piles of code that we can't > > > > actually properly test until everyone else has written their piece and > > > > then we use explicit sync if and only if all components support it. > > > > Really? We're going to do multiple years of development and then just > > > > hope it works when we finally flip the switch? That doesn't sound > > > > like a good plan to me. > > > > > > > > > > > > ## A proposal: Implicit and explicit sync together > > > > > > > > How to solve all these chicken-and-egg problems is something I've been > > > > giving quite a bit of thought (and talking with many others about) in > > > > the last couple of years. One motivation for this is that we have to > > > > deal with a mismatch in Vulkan. Another motivation is that I'm > > > > becoming increasingly unhappy with the way that synchronization, > > > > memory residency, and command submission are inherently intertwined in > > > > i915 and would like to break things apart. Towards that end, I have > > > > an actual proposal. > > > > > > > > A couple weeks ago, I sent a series of patches to the dri-devel > > > > mailing list which adds a pair of new ioctls to dma-buf which allow > > > > userspace to manually import or export a sync_file from a dma-buf. > > > > The idea is that something like a Wayland compositor can switch to > > > > 100% explicit sync internally once the ioctl is available. If it gets > > > > buffers in from a client that doesn't use the explicit sync extension, > > > > it can pull a sync_file from the dma-buf and use that exactly as it > > > > would a sync_file passed via the explicit sync extension. When it > > > > goes to scan out a user buffer and discovers that KMS doesn't accept > > > > sync_files (or if it tries to use that pesky media encoder no one has > > > > converted), it can take it's sync_file for display and stuff it into > > > > the dma-buf before handing it to KMS. > > > > > > > > Along with the kernel patches, I've also implemented support for this > > > > in the Vulkan WSI code used by ANV and RADV. With those patches, the > > > > only requirement on the Vulkan drivers is that you be able to export > > > > any VkSemaphore as a sync_file and temporarily import a sync_file into > > > > any VkFence or VkSemaphore. As long as that works, the core Vulkan > > > > driver only ever sees explicit synchronization via sync_file. The WSI > > > > code uses these new ioctls to translate the implicit sync of X11 and > > > > Wayland to the explicit sync the Vulkan driver wants. > > > > > > > > I'm hoping (and here's where I want a sanity check) that a simple API > > > > like this will allow us to finally start moving the Linux ecosystem > > > > over to explicit synchronization one piece at a time in a way that's > > > > actually correct. (No Wayland explicit sync with compositors hoping > > > > KMS magically works even though it doesn't have a sync_file API.) > > > > Once some pieces in the ecosystem start moving, there will be > > > > motivation to start moving others and maybe we can actually build the > > > > momentum to get most everything converted. > > > > > > > > For reference, you can find the kernel RFC patches and mesa MR here: > > > > > > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html > > > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037 > > > > > > > > At this point, I welcome your thoughts, comments, objections, and > > > > maybe even help/review. :-) > > > > > > > > --Jason Ekstrand > > > > -- > Regards, > > Laurent Pinchart gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
In reply to this post by Daniel Stone
> GL and GLES are not relevant. What is relevant is EGL, which defines
> interfaces to make things work on the native platform. Yes and no. This is what EGL spec says about sharing a texture between contexts: "OpenGL and OpenGL ES makes no attempt to synchronize access to texture objects. If a texture object is bound to more than one context, then it is up to the programmer to ensure that the contents of the object are not being changed via one context while another context is using the texture object for rendering. The results of changing a texture object while another context is using it are undefined." There are similar statements with regards to the lack of synchronisation guarantees for EGL images or between GL and native rendering, etc. But the main thing here is that EGL and Vulkan differ significantly. The eglSwapBuffers() is expected to post an unspecified "back buffer" to the display system using some internal driver magic. EGL driver is then expected to obtain another back buffer at some unspecified point in the future. Vulkan on the other hand is very specific and explicit. The vkQueuePresentKHR() is expected to post a specific vkImage with an explicit set of set of semaphores. Another image is obtained through vkAcquireNextImageKHR() and it's the application's decision whether it wants a fence, a semaphore, both or none with the acquired buffer. The implicit synchronisation doesn't mix well with Vulkan drivers and requires a lot of extra plumbing in the WSI code. > If you are using EGL_WL_bind_wayland_display, then one of the things > it is explicitly allowed/expected to do is to create a Wayland > protocol interface between client and compositor, which can be used to > pass buffer handles and metadata in a platform-specific way. Adding > synchronisation is also possible. Only one-way synchronisation is possible with this mechanism. There's a standard protocol for recycling buffers - wl_buffer_release() so buffer hand-over from the compositor to client remains unsynchronised - see below. > > The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver). > > You can add your own explicit synchronisation extension. I could but that requires implementing in in the driver and in a number of compositors, therefore a standard extension zwp_linux_explicit_synchronization_v1 is much better choice here than a custom one. > In every cross-process and cross-subsystem usecase, synchronisation is > obviously required. The two options for this are to implement kernel > support for implicit synchronisation (as everyone else has done), That would require major changes in driver architecture or a 2nd mechanisms doing the same thing but in kernel space - both are non-starters. > or implement generic support for explicit synchronisation (as we have > been working on with implementations inside Weston and Exosphere at > least), The zwp_linux_explicit_synchronization_v1 is a good step forward. I'm using this extension as a main synchronisation mechanism in EGL and Vulkan driver whenever available. I remember that Gustavo Padovan was working on explicit sync support in the display system some time ago. I hope it got merged into kernel by now, but I don't know to what extend it's actually being used. > or implement private support for explicit synchronisation, If everything else fails, that would be the last resort scenario, but far from ideal and very costly in terms of implementation and maintenance as it would require maintaining custom patches for various 3rd party components or littering them with multiple custom explicit synchronisation schemes. > or do nothing and then be surprised at the lack of synchronisation. Thank you, but no, thank you :) Cheers, Tomek _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
> vkAcquireNextImageKHR() [...] it's the application's decision whether it wants a fence, a semaphore, both or none
Correction: "or none" is not allowed _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
In reply to this post by Tomek Bury
On Mon, Mar 16, 2020 at 10:33 AM Tomek Bury <[hidden email]> wrote:
> > > GL and GLES are not relevant. What is relevant is EGL, which defines > > interfaces to make things work on the native platform. > Yes and no. This is what EGL spec says about sharing a texture between contexts: > > "OpenGL and OpenGL ES makes no attempt to synchronize access to > texture objects. If a texture object is bound to more than one > context, then it is up to the programmer to ensure that the contents > of the object are not being changed via one context while another > context is using the texture object for rendering. The results of > changing a texture object while another context is using it are > undefined." > > There are similar statements with regards to the lack of > synchronisation guarantees for EGL images or between GL and native > rendering, etc. But the main thing here is that EGL and Vulkan differ > significantly. The eglSwapBuffers() is expected to post an unspecified > "back buffer" to the display system using some internal driver magic. > EGL driver is then expected to obtain another back buffer at some > unspecified point in the future. Vulkan on the other hand is very > specific and explicit. The vkQueuePresentKHR() is expected to post a > specific vkImage with an explicit set of set of semaphores. Another > image is obtained through vkAcquireNextImageKHR() and it's the > application's decision whether it wants a fence, a semaphore, both or > none with the acquired buffer. The implicit synchronisation doesn't > mix well with Vulkan drivers and requires a lot of extra plumbing in > the WSI code. Yes, and that (the Vulkan issues in particular) is what I'm trying to fix. :-) (among other things...) Assuming the kernel patch I linked to, your usermode driver could stuff fences in the dma-buf without having that be part of your kernel driver. This assumes, of course, that your kernel driver supports sync_file. > > If you are using EGL_WL_bind_wayland_display, then one of the things > > it is explicitly allowed/expected to do is to create a Wayland > > protocol interface between client and compositor, which can be used to > > pass buffer handles and metadata in a platform-specific way. Adding > > synchronisation is also possible. > Only one-way synchronisation is possible with this mechanism. There's > a standard protocol for recycling buffers - wl_buffer_release() so > buffer hand-over from the compositor to client remains unsynchronised > > - see below. > > > > The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver). > > > > You can add your own explicit synchronisation extension. > I could but that requires implementing in in the driver and in a > number of compositors, therefore a standard extension > zwp_linux_explicit_synchronization_v1 is much better choice here than > a custom one. I think you may be missing what Daniel is saying. Wayland allows you to do basically anything you want within your client and server-side EGL implementations. That could include the server-side EGL sending an event with a fence every single time a flush operation happens in the server-side GL/GLES implementation. (Could be glFlush, glFinish, eglSwapBuffers, or other things). Since wayland protocol events are ordered, the client-side EGL implementation would get the most recent flush event before it got the wl_buffer::release. I fully agree that it's rather cumbersome though. > > In every cross-process and cross-subsystem usecase, synchronisation is > > obviously required. The two options for this are to implement kernel > > support for implicit synchronisation (as everyone else has done), > That would require major changes in driver architecture or a 2nd > mechanisms doing the same thing but in kernel space - both are > non-starters. > > > or implement generic support for explicit synchronisation (as we have > > been working on with implementations inside Weston and Exosphere at > > least), > The zwp_linux_explicit_synchronization_v1 is a good step forward. I'm > using this extension as a main synchronisation mechanism in EGL and > Vulkan driver whenever available. I remember that Gustavo Padovan was > working on explicit sync support in the display system some time ago. > I hope it got merged into kernel by now, but I don't know to what > extend it's actually being used. It is supported by KMS/atomic. Legacy KMS, however, does not support it. > > or implement private support for explicit synchronisation, > If everything else fails, that would be the last resort scenario, but > far from ideal and very costly in terms of implementation and > maintenance as it would require maintaining custom patches for various > 3rd party components or littering them with multiple custom explicit > synchronisation schemes. If you want to see explicit synchronization everywhere, I would very much like to see more developers driving its adoption. I implemented support in the Intel Vulkan driver last week: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4169 Hopefully, that will provide some motivation for other compositors (kwin, gnome-shell, etc.) because they now have a real user of it in an upstream driver for a major desktop platform and not just a few weston examples. However, someone is going to have to drive the actual development in those compositors. I'd be very happy if more people got involved, :-) --Jason _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
In reply to this post by Tomek Bury
Hi,
On Mon, 16 Mar 2020 at 15:33, Tomek Bury <[hidden email]> wrote: > > GL and GLES are not relevant. What is relevant is EGL, which defines > > interfaces to make things work on the native platform. > Yes and no. This is what EGL spec says about sharing a texture between contexts: Contexts are different though ... > There are similar statements with regards to the lack of > synchronisation guarantees for EGL images or between GL and native > rendering, etc. This also isn't about native rendering. > But the main thing here is that EGL and Vulkan differ > significantly. Sure, I totally agree. > The eglSwapBuffers() is expected to post an unspecified > "back buffer" to the display system using some internal driver magic. > EGL driver is then expected to obtain another back buffer at some > unspecified point in the future. Yes, this is rather the point: EGL doesn't specify platform-related 'black magic' to make things just work, because that's part of the platform implementation details. And, as things stand, on Linux one of those things is implicit synchronisation, unless the desired end state of your driver is no synchronisation. This thread is a discussion about changing that. > > If you are using EGL_WL_bind_wayland_display, then one of the things > > it is explicitly allowed/expected to do is to create a Wayland > > protocol interface between client and compositor, which can be used to > > pass buffer handles and metadata in a platform-specific way. Adding > > synchronisation is also possible. > Only one-way synchronisation is possible with this mechanism. There's > a standard protocol for recycling buffers - wl_buffer_release() so > buffer hand-over from the compositor to client remains unsynchronised > - see below. That's not true; you can post back a sync token every time the client buffer is used by the compositor. > > > The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver). > > > > You can add your own explicit synchronisation extension. > I could but that requires implementing in in the driver and in a > number of compositors, therefore a standard extension > zwp_linux_explicit_synchronization_v1 is much better choice here than > a custom one. EGL_WL_bind_wayland_display is explicitly designed to allow each driver to implement its own private extensions without modifying compositors. For instance, Mesa adds the `wl_drm` extension, which is used for bidirectional communication between the EGL implementations in the client and compositor address spaces, without modifying either. > > In every cross-process and cross-subsystem usecase, synchronisation is > > obviously required. The two options for this are to implement kernel > > support for implicit synchronisation (as everyone else has done), > That would require major changes in driver architecture or a 2nd > mechanisms doing the same thing but in kernel space - both are > non-starters. OK. As it stands, everyone else has the kernel mechanism (e.g. via dmabuf resv), so in this case if you are reinventing the underlying platform in a proprietary stack, you get to solve the same problems yourselves. Cheers, Daniel _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.freedesktop.org/mailman/listinfo/gstreamer-devel |
Free forum by Nabble | Edit this page |