Hi folks,
I noticed performance issues due to the rewrite of pulsesink since the 0.10.15 release. The degradation is in the 30% range on my Atom board when playing MP3/AAC. There have been a couple of modifications in git related to buffer attributes and latency settings, but overall the overhead remains, and the pulsesink code could be further optimized for low-power playback apps that don't care about latency. I finally took the time to look at the code and check what was going on. It seems that the overhead is mainly due to the granularity of transfers between pulsesink and PulseAudio. What happens is that the sink waits for space available in the PulseAudio buffer. When PA requests data in a callback, the mainloop unblocks and the sink writes its PCM to PulseAudio. The problem is that the sink will not try to fill the whole buffer before handing-off the data to PulseAudio. For example, say PulseAudio requests 100k (as defined by minreq) and you are doing MP3 decode, you are going to send one frame (4608 bytes) at a time to PulseAudio until the 100k have been filled. That's a lot of overhead. It would be a lot more efficient power-wise to decode and store as many frames as possible into the PA buffer before calling pa_stream_write(). I have snippets of code as a proof of concept. I don't mind releasing the code, but I must admit this is a hack and does not cover all the cases pulsesink addresses. An additional optimization could consist in passing the PulseAudio buffer upstream to avoid memory copies. The new PA release provides support for this with pa_stream_begin_write(). In short, I would badly need a review from more experienced developers... If anyone is interested let me know. Cheers, - Pierre ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
pl bossart write:
> Hi folks, > I noticed performance issues due to the rewrite of pulsesink since the > 0.10.15 release. The degradation is in the 30% range on my Atom board > when playing MP3/AAC. There have been a couple of modifications in git > related to buffer attributes and latency settings, but overall the > overhead remains, and the pulsesink code could be further optimized > for low-power playback apps that don't care about latency. I noticed the same on the Nokia N900. > I finally took the time to look at the code and check what was going > on. It seems that the overhead is mainly due to the granularity of > transfers between pulsesink and PulseAudio. What happens is that the > sink waits for space available in the PulseAudio buffer. When PA > requests data in a callback, the mainloop unblocks and the sink writes > its PCM to PulseAudio. The problem is that the sink will not try to > fill the whole buffer before handing-off the data to PulseAudio. For > example, say PulseAudio requests 100k (as defined by minreq) and you > are doing MP3 decode, you are going to send one frame (4608 bytes) at > a time to PulseAudio until the 100k have been filled. That's a lot of > overhead. It would be a lot more efficient power-wise to decode and > store as many frames as possible into the PA buffer before calling > pa_stream_write(). Wim just committed my patch that changes pulsesink back to set the minreq to the value of the latency-time property, which lets applications tune the gst<->pa overhead again. During the investigation of that regression, I found that there is some further things to optimize in pulsesink. I will be filing more bugs and sending more patches as I come up with better solutions. > I have snippets of code as a proof of concept. I don't mind releasing > the code, but I must admit this is a hack and does not cover all the > cases pulsesink addresses. An additional optimization could consist in > passing the PulseAudio buffer upstream to avoid memory copies. The new > PA release provides support for this with pa_stream_begin_write(). In > short, I would badly need a review from more experienced developers... > If anyone is interested let me know. > > Cheers, > - Pierre Using that API is a step into the right direction. However there is still a lot to do. GStreamer desperately needs a zero-copy mechanic for audio such that the audio decoders' output buffer sizing doesn't incur arbitrary overhead. For the time being, I think you can get almost the same performance/battery life gain by increasing the output buffer size of your audio decoders. Felipe Contreras has been trying this with the vorbis decoder, with good results. -- Regards, René Stadler ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
Howdy Rene',
> Wim just committed my patch that changes pulsesink back to set the minreq to > the value of the latency-time property, which lets applications tune the > gst<->pa overhead again. Humm, my experiments show that the core activity increases when minreq is > 64k. I sort of remember Lennart mentioning that this was the size of the block allocated in PA, and beyond this you would use malloc(). Besides, it seems to me that the total latency is really defined by tlength, if you increase minreq the size of the server buffer will be adjusted. See Lennart's page at http://pulseaudio.org/wiki/LatencyControl, latency is defined with tlength, minreq has no direct impact on latency. And as I mentioned it, the patch doesn't change the overhead since we keep writing the same size no matter what minreq was set to. > During the investigation of that regression, I found that there is some further > things to optimize in pulsesink. I will be filing more bugs and sending more > patches as I come up with better solutions. Will send you my code. > For the time being, I think you can get almost the same performance/battery > life gain by increasing the output buffer size of your audio decoders. Felipe > Contreras has been trying this with the vorbis decoder, with good results. That's not necessarily an option. There are 3rd party decoders out there whose code is not necessarily public. And fixing the decoders is somewhat odd when the real problem is the sink... Cheers -Pierre ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
pl bossart wrote:
> Howdy Rene', > >> Wim just committed my patch that changes pulsesink back to set the minreq to >> the value of the latency-time property, which lets applications tune the >> gst<->pa overhead again. > > Humm, my experiments show that the core activity increases when minreq > is > 64k. I sort of remember Lennart mentioning that this was the size > of the block allocated in PA, and beyond this you would use malloc(). Note that minreq is just the threshold when pulse will ask for more data. You are free to send whatever amount is writable when you have data ready, it can be smaller or larger than minreq (pulsesink does exactly that). I don't know how malloc comes into play here. I just know that it makes technically no sense to write buffers larger than 64K to pulse: The client library chops them down to 64K chunks because that is the internal size limit. That is, the IPC overhead of sending two 64K vs one 128K buffer is exactly the same. > Besides, it seems to me that the total latency is really defined by > tlength, if you increase minreq the size of the server buffer will be > adjusted. See Lennart's page at > http://pulseaudio.org/wiki/LatencyControl, latency is defined with > tlength, minreq has no direct impact on latency. > And as I mentioned it, the patch doesn't change the overhead since we > keep writing the same size no matter what minreq was set to. Yes indeed, in fact the patch gives next to no CPU load improvement. However, it leads to the writes from gst to pa being grouped together with larger intervals of inactivity in between (tunable with the latency-time property). This grouping together results in improved power management. In the N900 I measured a penalty of 10% in energy consumption without the patch applied (for MP3 on wired headset, display off, i.e. typical long term playback use-case). >> During the investigation of that regression, I found that there is some further >> things to optimize in pulsesink. I will be filing more bugs and sending more >> patches as I come up with better solutions. > > Will send you my code. > >> For the time being, I think you can get almost the same performance/battery >> life gain by increasing the output buffer size of your audio decoders. Felipe >> Contreras has been trying this with the vorbis decoder, with good results. > > That's not necessarily an option. There are 3rd party decoders out > there whose code is not necessarily public. And fixing the decoders is > somewhat odd when the real problem is the sink... > Cheers > -Pierre The sink is not perfect, but the decoder situation also need work. Current decoders chose the output buffer sizes themselves, and this is wrong. Yes you could change the sink and stitch these buffers together using pad_alloc, but the fact remains that the decoder picks the size and therefore decides on the overhead up to the sink (and all processing elements between decoder and sink). This became apparent to me when Felipe profiled OggVorbis playback with a highly optimized decoder (ffmpeg). Basically the CPU spends an insane amount of time pushing GStreamer buffers around compared the actual audio decoding. And this on the N900, which shows exactly that the current situation is complete nonsense for a battery-powered device. -- Regards, René Stadler ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
On Thu, Oct 15, 2009 at 3:18 AM, René Stadler <[hidden email]> wrote:
> pl bossart wrote: >> That's not necessarily an option. There are 3rd party decoders out >> there whose code is not necessarily public. And fixing the decoders is >> somewhat odd when the real problem is the sink... > > The sink is not perfect, but the decoder situation also need work. Current > decoders chose the output buffer sizes themselves, and this is wrong. Yes > you could change the sink and stitch these buffers together using pad_alloc, > but the fact remains that the decoder picks the size and therefore decides > on the overhead up to the sink (and all processing elements between decoder > and sink). > > This became apparent to me when Felipe profiled OggVorbis playback with a > highly optimized decoder (ffmpeg). Basically the CPU spends an insane amount > of time pushing GStreamer buffers around compared the actual audio decoding. > And this on the N900, which shows exactly that the current situation is > complete nonsense for a battery-powered device. Indeed. I profiled the audio pipeline and 30% of the time was spent on the decoder, the rest was spent pushing buffers around. When I increased the buffer sizes pushed by the decoder (128k) efficiency increases, now the time spent is 45%, but still, 55% CPU time spent pushing buffers around is unacceptable. Profiling what happens on pulseaudio side is an exercise I haven't done yet, but as René said, my guess is that there's some ideal buffer size that pulseaudio would like to receive from the decoder that's big enough for GStreamer not to choke on it. Removing the queue from the decoder to the sink would also help to avoid the unnecessary overhead of mutex contention, specially on small buffers. Ideally I guess the sink should be able to receive small buffers without performance penalty, but currently that doesn't seem to be case. Cheers. -- Felipe Contreras ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
On Thu, 2009-10-15 at 13:01 +0300, Felipe Contreras wrote:
> On Thu, Oct 15, 2009 at 3:18 AM, René Stadler <[hidden email]> wrote: > > pl bossart wrote: > >> That's not necessarily an option. There are 3rd party decoders out > >> there whose code is not necessarily public. And fixing the decoders is > >> somewhat odd when the real problem is the sink... > > > > The sink is not perfect, but the decoder situation also need work. Current > > decoders chose the output buffer sizes themselves, and this is wrong. Yes > > you could change the sink and stitch these buffers together using pad_alloc, > > but the fact remains that the decoder picks the size and therefore decides > > on the overhead up to the sink (and all processing elements between decoder > > and sink). > > > > This became apparent to me when Felipe profiled OggVorbis playback with a > > highly optimized decoder (ffmpeg). Basically the CPU spends an insane amount > > of time pushing GStreamer buffers around compared the actual audio decoding. > > And this on the N900, which shows exactly that the current situation is > > complete nonsense for a battery-powered device. > > Indeed. I profiled the audio pipeline and 30% of the time was spent on > the decoder, the rest was spent pushing buffers around. When I > increased the buffer sizes pushed by the decoder (128k) efficiency > increases, now the time spent is 45%, but still, 55% CPU time spent > pushing buffers around is unacceptable. It's now also possible to push multiple buffers at the same time by using the buffer lists. I don't know if that solves anything here, I guess it depends on the amount of encoded frames that a decoder receives. We could for example change the ogg demuxer to push all packets in a page in a bufferlist and make the vorbisdecoder decode the complete list before pushing the list of samples to the sink. This would reduce the amount of objects that get pushed around between elements. Also, pushing buffers should not be as expensive as 55%, I don't know what's happening there, maybe the gobject allocation is what's slowing it down (patches exist for glib) ? maybe the locking/refcounting is slow (should be simple atomic operations on N900..)? maybe the typechecking (also patches exist for glib)? It would be nice to know what's causing this overhead in more detail. With newer pulse we can't really use buffer_alloc to write directly into the pulse shared memory from the decoder because there is no api to allocate such a chunk, there is pa_stream_begin_write() but that can only be called once AFAIK. Wim > > Profiling what happens on pulseaudio side is an exercise I haven't > done yet, but as René said, my guess is that there's some ideal buffer > size that pulseaudio would like to receive from the decoder that's big > enough for GStreamer not to choke on it. > > Removing the queue from the decoder to the sink would also help to > avoid the unnecessary overhead of mutex contention, specially on small > buffers. > > Ideally I guess the sink should be able to receive small buffers > without performance penalty, but currently that doesn't seem to be > case. > > Cheers. > ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
In reply to this post by René Stadler
> Note that minreq is just the threshold when pulse will ask for more data. You
> are free to send whatever amount is writable when you have data ready, it can > be smaller or larger than minreq (pulsesink does exactly that). <snip> > Yes indeed, in fact the patch gives next to no CPU load improvement. However, > it leads to the writes from gst to pa being grouped together with larger > intervals of inactivity in between (tunable with the latency-time property). > This grouping together results in improved power management. In the N900 I > measured a penalty of 10% in energy consumption without the patch applied (for > MP3 on wired headset, display off, i.e. typical long term playback use-case). My point was that the buffer_time property is used to set the audio latency, while the latency_time property doesn't set any latency, only the granularity of the gstreamer processing. This is not exactly self-explanatory without knowing in detail how PulseAudio works. To be more consistent, we should rename these properties. In PulseAudio pacat, the options are called --latency and --process-time, this is a lot more intuitive than the current gstreamer options. While I don't have qualitative data, I concur with Felipe's observations. I have a gstreamer-based audio player running at 8-9% CPU while a stand-alone player using the same decoding engine needs 5% in the same conditions (no UI, etc). That's a lot of overhead... ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
In reply to this post by Wim Taymans
On Thu, Oct 15, 2009 at 1:58 PM, Wim Taymans <[hidden email]> wrote:
> On Thu, 2009-10-15 at 13:01 +0300, Felipe Contreras wrote: >> On Thu, Oct 15, 2009 at 3:18 AM, René Stadler <[hidden email]> wrote: >> > pl bossart wrote: >> >> That's not necessarily an option. There are 3rd party decoders out >> >> there whose code is not necessarily public. And fixing the decoders is >> >> somewhat odd when the real problem is the sink... >> > >> > The sink is not perfect, but the decoder situation also need work. Current >> > decoders chose the output buffer sizes themselves, and this is wrong. Yes >> > you could change the sink and stitch these buffers together using pad_alloc, >> > but the fact remains that the decoder picks the size and therefore decides >> > on the overhead up to the sink (and all processing elements between decoder >> > and sink). >> > >> > This became apparent to me when Felipe profiled OggVorbis playback with a >> > highly optimized decoder (ffmpeg). Basically the CPU spends an insane amount >> > of time pushing GStreamer buffers around compared the actual audio decoding. >> > And this on the N900, which shows exactly that the current situation is >> > complete nonsense for a battery-powered device. >> >> Indeed. I profiled the audio pipeline and 30% of the time was spent on >> the decoder, the rest was spent pushing buffers around. When I >> increased the buffer sizes pushed by the decoder (128k) efficiency >> increases, now the time spent is 45%, but still, 55% CPU time spent >> pushing buffers around is unacceptable. > > It's now also possible to push multiple buffers at the same time by > using the buffer lists. I don't know if that solves anything here, I > guess it depends on the amount of encoded frames that a decoder > receives. > > We could for example change the ogg demuxer to push all packets in a > page in a bufferlist and make the vorbisdecoder decode the complete list > before pushing the list of samples to the sink. This would reduce the > amount of objects that get pushed around between elements. I don't think that would help. What we need is to push big buffers to pulsesink, regardless of what we receive on the input. It's not a big problem for the decoder to fill some temporary buffer. > Also, pushing buffers should not be as expensive as 55%, I don't know > what's happening there, maybe the gobject allocation is what's slowing > it down (patches exist for glib) ? maybe the locking/refcounting is slow > (should be simple atomic operations on N900..)? maybe the typechecking > (also patches exist for glib)? It would be nice to know what's causing > this overhead in more detail. It seems to me that allocating and unrefing buffers is taking much more than expected: http://people.freedesktop.org/~felipec/profile/mp3-1.png > With newer pulse we can't really use buffer_alloc to write directly into > the pulse shared memory from the decoder because there is no api to > allocate such a chunk, there is pa_stream_begin_write() but that can > only be called once AFAIK. I think memcpy is the least grave of the problems right now. -- Felipe Contreras ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
>> We could for example change the ogg demuxer to push all packets in a
>> page in a bufferlist and make the vorbisdecoder decode the complete list >> before pushing the list of samples to the sink. This would reduce the >> amount of objects that get pushed around between elements. > > I don't think that would help. What we need is to push big buffers to > pulsesink, regardless of what we receive on the input. It's not a big > problem for the decoder to fill some temporary buffer. It does not help cpu- or power-wise to push buffers larger than 64k into PulseAudio, so the 'big' buffers would be limited to ~370ms or ~14 decoded MP3 frames. Given this upper bound, how would the decoder know how big the buffers should really be? If somehow you don't provide the latency information from the sink back to the decoder, the decoder is going to make arbitrary decisions no matter what context it is used in. If you are doing audio only, using large buffers is no issue, but if you are using the same decoder with video active, you may want to avoid too large buffers. ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
On Fri, Oct 16, 2009 at 12:21 AM, pl bossart <[hidden email]> wrote:
>>> We could for example change the ogg demuxer to push all packets in a >>> page in a bufferlist and make the vorbisdecoder decode the complete list >>> before pushing the list of samples to the sink. This would reduce the >>> amount of objects that get pushed around between elements. >> >> I don't think that would help. What we need is to push big buffers to >> pulsesink, regardless of what we receive on the input. It's not a big >> problem for the decoder to fill some temporary buffer. > > It does not help cpu- or power-wise to push buffers larger than 64k > into PulseAudio, so the 'big' buffers would be limited to ~370ms or > ~14 decoded MP3 frames. Currently playbin2 adds a queue between the decoder and the sink, so they run in different threads, and the thread synchronization will result in more overhead with smaller buffers (not to mention buffer creation/destruction). So from GStreamer point of view, there's a direct relationship between buffer size and CPU overhead. I'm not sure if "too big buffers" would actually impact negatively PA, but my guess is there's a limit to how big the buffer should be, and that would actually be the ideal one. > Given this upper bound, how would the decoder know how big the buffers > should really be? If somehow you don't provide the latency information > from the sink back to the decoder, the decoder is going to make > arbitrary decisions no matter what context it is used in. If you are > doing audio only, using large buffers is no issue, but if you are > using the same decoder with video active, you may want to avoid too > large buffers. That is true. I haven't thought about the video case, from my point of view audio-only playback is already too screwed up to think about that. What's the worst that could happen? A temporary A/V synchronization miss-match when the video decoder lags behind? -- Felipe Contreras ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
In reply to this post by pl bossart
On Wed, 14.10.09 14:44, pl bossart ([hidden email]) wrote:
> Hi folks, > I noticed performance issues due to the rewrite of pulsesink since the > 0.10.15 release. The degradation is in the 30% range on my Atom board > when playing MP3/AAC. There have been a couple of modifications in git > related to buffer attributes and latency settings, but overall the > overhead remains, and the pulsesink code could be further optimized > for low-power playback apps that don't care about latency. > > I finally took the time to look at the code and check what was going > on. It seems that the overhead is mainly due to the granularity of > transfers between pulsesink and PulseAudio. What happens is that the > sink waits for space available in the PulseAudio buffer. When PA > requests data in a callback, the mainloop unblocks and the sink writes > its PCM to PulseAudio. The problem is that the sink will not try to > fill the whole buffer before handing-off the data to PulseAudio. For > example, say PulseAudio requests 100k (as defined by minreq) and you > are doing MP3 decode, you are going to send one frame (4608 bytes) at > a time to PulseAudio until the 100k have been filled. That's a lot of > overhead. It would be a lot more efficient power-wise to decode and > store as many frames as possible into the PA buffer before calling > pa_stream_write(). This is mostly correct. But actually finding the right buffer sizes to send to PA is a science of its own. If you have to fill a 2s buffer and you calculate audio for that all in one step and send it in one packet to PA then you might have to do some CPU intensive work for quite some time (e.g. decoding AC3) during which PA might run out of data to play. Which might become a problem. So the general rule is to do send packets as big as possible but not to block for that for too long. This is of course a very imprecise definition. Also, for optimizing the data tranfer via SHM you shouldn't use memory blocks larger than 64k right now (actually a little less), which is the SHM tile size. I probably should export that value in libpulse in some way, so that the clients can optimize for it, and pass blocks of size MIN(pa_stream_get_writable_size(), pa_context_get_tile_size()) or so. I'll add that in the next release. And I think that block size would be a good value to optimize the writes for. Unless one starts counting CPU cycles finding the perfect block size is not possible anyway. > I have snippets of code as a proof of concept. I don't mind releasing > the code, but I must admit this is a hack and does not cover all the > cases pulsesink addresses. An additional optimization could consist in > passing the PulseAudio buffer upstream to avoid memory copies. The new > PA release provides support for this with pa_stream_begin_write(). In > short, I would badly need a review from more experienced developers... > If anyone is interested let me know. In fact I added the _begin_write() stuff specifically for use in GStreamer, after a talk the Gst folks and I had at last FOSDEM. Lennart -- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net http://0pointer.net/lennart/ GnuPG 0x1A015CC4 ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
In reply to this post by René Stadler
On Thu, 15.10.09 00:48, René Stadler ([hidden email]) wrote:
> >I finally took the time to look at the code and check what was going > >on. It seems that the overhead is mainly due to the granularity of > >transfers between pulsesink and PulseAudio. What happens is that the > >sink waits for space available in the PulseAudio buffer. When PA > >requests data in a callback, the mainloop unblocks and the sink writes > >its PCM to PulseAudio. The problem is that the sink will not try to > >fill the whole buffer before handing-off the data to PulseAudio. For > >example, say PulseAudio requests 100k (as defined by minreq) and you > >are doing MP3 decode, you are going to send one frame (4608 bytes) at > >a time to PulseAudio until the 100k have been filled. That's a lot of > >overhead. It would be a lot more efficient power-wise to decode and > >store as many frames as possible into the PA buffer before calling > >pa_stream_write(). > > Wim just committed my patch that changes pulsesink back to set the > minreq to the value of the latency-time property, which lets > applications tune the gst<->pa overhead again. In this context: a few days ago I wrote up this wiki page which tries to explain how to configure latency properly for pa streams: http://pulseaudio.org/wiki/LatencyControl > During the investigation of that regression, I found that there is > some further things to optimize in pulsesink. I will be filing more > bugs and sending more patches as I come up with better solutions. BTW, any chance I could be subscribed automatically to all bugs regarding pulsesink? Anyone knows if gnome bz can do that? Lennart -- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net http://0pointer.net/lennart/ GnuPG 0x1A015CC4 ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
In reply to this post by pl bossart
On Wed, 14.10.09 18:13, pl bossart ([hidden email]) wrote:
> > Howdy Rene', > > > Wim just committed my patch that changes pulsesink back to set the minreq to > > the value of the latency-time property, which lets applications tune the > > gst<->pa overhead again. > > Humm, my experiments show that the core activity increases when minreq > is > 64k. I sort of remember Lennart mentioning that this was the size > of the block allocated in PA, and beyond this you would use > malloc(). Yes that is true (as mentioned in my response i sent 5min ago. I probably should have read the full thread before responding...) Story goes like this: If you call pa_stream_write() with memory blocks < 64k, then PA will be able to place your entire data in a single SHM tile, and is then able to send the whole thing in one step to the other side. If you pick larger sizes, then PA cannot optimize things that way and might end up sending the data via the socket instead of shm. Lennart -- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net http://0pointer.net/lennart/ GnuPG 0x1A015CC4 ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
In reply to this post by René Stadler
On Thu, 15.10.09 03:18, René Stadler ([hidden email]) wrote:
> >Besides, it seems to me that the total latency is really defined by > >tlength, if you increase minreq the size of the server buffer will be > >adjusted. See Lennart's page at > >http://pulseaudio.org/wiki/LatencyControl, latency is defined with > >tlength, minreq has no direct impact on latency. > >And as I mentioned it, the patch doesn't change the overhead since we > >keep writing the same size no matter what minreq was set to. > > Yes indeed, in fact the patch gives next to no CPU load improvement. > However, it leads to the writes from gst to pa being grouped > together with larger intervals of inactivity in between (tunable > with the latency-time property). This grouping together results in > improved power management. In the N900 I measured a penalty of 10% > in energy consumption without the patch applied (for MP3 on wired > headset, display off, i.e. typical long term playback use-case). Hm, I wonder if I should formalize that in the PA API. i.e. provide something that would allow the app to officially declare when one of those packet "bursts" starts and when it ends? Something like this: pa_stream_begin_write_burst(); pa_stream_write(...); pa_stream_write(...); pa_stream_write(...); pa_stream_write(...); pa_stream_begin_end_burst(); And then add a couple of optimizations internally that would already flush the buffers before the burst is over according to some wallclock timeout or a a full shm tile or so. Or maybe that is too complex. Dunno. Lennart -- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net http://0pointer.net/lennart/ GnuPG 0x1A015CC4 ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
In reply to this post by Wim Taymans
On Thu, 15.10.09 12:58, Wim Taymans ([hidden email]) wrote:
> With newer pulse we can't really use buffer_alloc to write directly into > the pulse shared memory from the decoder because there is no api to > allocate such a chunk, there is pa_stream_begin_write() but that can > only be called once AFAIK. Hmm, so you'd like to see some API in PA that allows you to allocate, ref, unref multiple memory blocks at the same time and independantly of each other? Internally that's what happens anyway, so we could definitely add that. The main reason why I didn't expose this directly is that this doesn't mix well with allowing the user to pass subsets of the allocated buffers to pa_stream_write(), unless we add a completely new function pa_stream_write_preallocated() that takes the buffer pointer, and index into it and a length. Which would be kinda clumsy, wouldn't it? Also I am a bit afraid of how the semantics of PA's and Gst's buffer handling might collide there. In PA memory blocks can actually change location in memory at any time. To lock them into an accessible place for a time you have to call an _acquire() function first and afterwards a _release() function, and in between your code should not do communication with other threads or other 'slow' stuff. This would not mix well with Gst's own buffer handling, would it? The current PA API makes clear in a way that the buffer should be allocated using pa_stream_write_begin() only very shortly before the actual _write(). (or at least it was my intention to make that clear). But if I export the whole memory block allocation API then this would be really hard to express in the API so that people whose memory allocation scheme is incompatible with this would not simply call _acquire() right away and then delay the _release() until the very last moment. The scheme that is now part of the PA API is also designed to be somewhat similar to how ALSA's mmap() API for playback work: when you want to write your data you ask for a pointer, and then push your data into it, and then commit this. I kinda assumed that Gst would gain support for alsa mmap eventually and it would make sense to follow the same scheme. What's the plan with that? Hmm, if all this zero-copy stuff is being rethougt for gst, maybe keeping alsa mmap io in mind might be a good idea. After all of the various alsa features the mmap io stuff is actually one that didn't trigger many problems when PA started to make heavy use of it. So I can only encourage its use ;-) Lennart -- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net http://0pointer.net/lennart/ GnuPG 0x1A015CC4 ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
Free forum by Nabble | Edit this page |