pulsesink optimizations

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

pulsesink optimizations

pl bossart
Hi folks,
I noticed performance issues due to the rewrite of pulsesink since the
0.10.15 release. The degradation is in the 30% range on my Atom board
when playing MP3/AAC. There have been a couple of modifications in git
related to buffer attributes and latency settings, but overall the
overhead remains, and the pulsesink code could be further optimized
for low-power playback apps that don't care about latency.

I finally took the time to look at the code and check what was going
on. It seems that the overhead is mainly due to the granularity of
transfers between pulsesink and PulseAudio. What happens is that the
sink waits for space available in the PulseAudio buffer. When PA
requests data in a callback, the mainloop unblocks and the sink writes
its PCM to PulseAudio. The problem is that the sink will not try to
fill the whole buffer before handing-off the data to PulseAudio. For
example, say PulseAudio requests 100k (as defined by minreq) and you
are doing MP3 decode, you are going to send one frame (4608 bytes) at
a time to PulseAudio until the 100k have been filled. That's a lot of
overhead. It would be a lot more efficient power-wise to decode and
store as many frames as possible into the PA buffer before calling
pa_stream_write().

I have snippets of code as a proof of concept. I don't mind releasing
the code, but I must admit this is a hack and does not cover all the
cases pulsesink addresses. An additional optimization could consist in
passing the PulseAudio buffer upstream to avoid memory copies. The new
PA release provides support for this with pa_stream_begin_write(). In
short, I would badly need a review from more experienced developers...
If anyone is interested let me know.

Cheers,
- Pierre

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

René Stadler
pl bossart write:
> Hi folks,
> I noticed performance issues due to the rewrite of pulsesink since the
> 0.10.15 release. The degradation is in the 30% range on my Atom board
> when playing MP3/AAC. There have been a couple of modifications in git
> related to buffer attributes and latency settings, but overall the
> overhead remains, and the pulsesink code could be further optimized
> for low-power playback apps that don't care about latency.

I noticed the same on the Nokia N900.

> I finally took the time to look at the code and check what was going
> on. It seems that the overhead is mainly due to the granularity of
> transfers between pulsesink and PulseAudio. What happens is that the
> sink waits for space available in the PulseAudio buffer. When PA
> requests data in a callback, the mainloop unblocks and the sink writes
> its PCM to PulseAudio. The problem is that the sink will not try to
> fill the whole buffer before handing-off the data to PulseAudio. For
> example, say PulseAudio requests 100k (as defined by minreq) and you
> are doing MP3 decode, you are going to send one frame (4608 bytes) at
> a time to PulseAudio until the 100k have been filled. That's a lot of
> overhead. It would be a lot more efficient power-wise to decode and
> store as many frames as possible into the PA buffer before calling
> pa_stream_write().

Wim just committed my patch that changes pulsesink back to set the minreq to
the value of the latency-time property, which lets applications tune the
gst<->pa overhead again.

During the investigation of that regression, I found that there is some further
things to optimize in pulsesink. I will be filing more bugs and sending more
patches as I come up with better solutions.

> I have snippets of code as a proof of concept. I don't mind releasing
> the code, but I must admit this is a hack and does not cover all the
> cases pulsesink addresses. An additional optimization could consist in
> passing the PulseAudio buffer upstream to avoid memory copies. The new
> PA release provides support for this with pa_stream_begin_write(). In
> short, I would badly need a review from more experienced developers...
> If anyone is interested let me know.
>
> Cheers,
> - Pierre

Using that API is a step into the right direction. However there is still a lot
to do. GStreamer desperately needs a zero-copy mechanic for audio such that the
audio decoders' output buffer sizing doesn't incur arbitrary overhead.

For the time being, I think you can get almost the same performance/battery
life gain by increasing the output buffer size of your audio decoders. Felipe
Contreras has been trying this with the vorbis decoder, with good results.

--
Regards,
   René Stadler

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

pl bossart
Howdy Rene',

> Wim just committed my patch that changes pulsesink back to set the minreq to
> the value of the latency-time property, which lets applications tune the
> gst<->pa overhead again.

Humm, my experiments show that the core activity increases when minreq
is > 64k. I sort of remember Lennart mentioning that this was the size
of the block allocated in PA, and beyond this you would use malloc().
Besides, it seems to me that the total latency is really defined by
tlength, if you increase minreq the size of the server buffer will be
adjusted. See Lennart's page at
http://pulseaudio.org/wiki/LatencyControl, latency is defined with
tlength, minreq has no direct impact on latency.
And as I mentioned it, the patch doesn't change the overhead since we
keep writing the same size no matter what minreq was set to.

> During the investigation of that regression, I found that there is some further
> things to optimize in pulsesink. I will be filing more bugs and sending more
> patches as I come up with better solutions.

Will send you my code.

> For the time being, I think you can get almost the same performance/battery
> life gain by increasing the output buffer size of your audio decoders. Felipe
> Contreras has been trying this with the vorbis decoder, with good results.

That's not necessarily an option. There are 3rd party decoders out
there whose code is not necessarily public. And fixing the decoders is
somewhat odd when the real problem is the sink...
Cheers
-Pierre

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

René Stadler
pl bossart wrote:
> Howdy Rene',
>
>> Wim just committed my patch that changes pulsesink back to set the minreq to
>> the value of the latency-time property, which lets applications tune the
>> gst<->pa overhead again.
>
> Humm, my experiments show that the core activity increases when minreq
> is > 64k. I sort of remember Lennart mentioning that this was the size
> of the block allocated in PA, and beyond this you would use malloc().

Note that minreq is just the threshold when pulse will ask for more data. You
are free to send whatever amount is writable when you have data ready, it can
be smaller or larger than minreq (pulsesink does exactly that).

I don't know how malloc comes into play here. I just know that it makes
technically no sense to write buffers larger than 64K to pulse: The client
library chops them down to 64K chunks because that is the internal size limit.
That is, the IPC overhead of sending two 64K vs one 128K buffer is exactly the
same.

> Besides, it seems to me that the total latency is really defined by
> tlength, if you increase minreq the size of the server buffer will be
> adjusted. See Lennart's page at
> http://pulseaudio.org/wiki/LatencyControl, latency is defined with
> tlength, minreq has no direct impact on latency.
> And as I mentioned it, the patch doesn't change the overhead since we
> keep writing the same size no matter what minreq was set to.

Yes indeed, in fact the patch gives next to no CPU load improvement. However,
it leads to the writes from gst to pa being grouped together with larger
intervals of inactivity in between (tunable with the latency-time property).
This grouping together results in improved power management. In the N900 I
measured a penalty of 10% in energy consumption without the patch applied (for
MP3 on wired headset, display off, i.e. typical long term playback use-case).

>> During the investigation of that regression, I found that there is some further
>> things to optimize in pulsesink. I will be filing more bugs and sending more
>> patches as I come up with better solutions.
>
> Will send you my code.
>
>> For the time being, I think you can get almost the same performance/battery
>> life gain by increasing the output buffer size of your audio decoders. Felipe
>> Contreras has been trying this with the vorbis decoder, with good results.
>
> That's not necessarily an option. There are 3rd party decoders out
> there whose code is not necessarily public. And fixing the decoders is
> somewhat odd when the real problem is the sink...
> Cheers
> -Pierre

The sink is not perfect, but the decoder situation also need work. Current
decoders chose the output buffer sizes themselves, and this is wrong. Yes you
could change the sink and stitch these buffers together using pad_alloc, but
the fact remains that the decoder picks the size and therefore decides on the
overhead up to the sink (and all processing elements between decoder and sink).

This became apparent to me when Felipe profiled OggVorbis playback with a
highly optimized decoder (ffmpeg). Basically the CPU spends an insane amount of
time pushing GStreamer buffers around compared the actual audio decoding. And
this on the N900, which shows exactly that the current situation is complete
nonsense for a battery-powered device.

--
Regards,
   René Stadler

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

Felipe Contreras
On Thu, Oct 15, 2009 at 3:18 AM, René Stadler <[hidden email]> wrote:

> pl bossart wrote:
>> That's not necessarily an option. There are 3rd party decoders out
>> there whose code is not necessarily public. And fixing the decoders is
>> somewhat odd when the real problem is the sink...
>
> The sink is not perfect, but the decoder situation also need work. Current
> decoders chose the output buffer sizes themselves, and this is wrong. Yes
> you could change the sink and stitch these buffers together using pad_alloc,
> but the fact remains that the decoder picks the size and therefore decides
> on the overhead up to the sink (and all processing elements between decoder
> and sink).
>
> This became apparent to me when Felipe profiled OggVorbis playback with a
> highly optimized decoder (ffmpeg). Basically the CPU spends an insane amount
> of time pushing GStreamer buffers around compared the actual audio decoding.
> And this on the N900, which shows exactly that the current situation is
> complete nonsense for a battery-powered device.

Indeed. I profiled the audio pipeline and 30% of the time was spent on
the decoder, the rest was spent pushing buffers around. When I
increased the buffer sizes pushed by the decoder (128k) efficiency
increases, now the time spent is 45%, but still, 55% CPU time spent
pushing buffers around is unacceptable.

Profiling what happens on pulseaudio side is an exercise I haven't
done yet, but as René said, my guess is that there's some ideal buffer
size that pulseaudio would like to receive from the decoder that's big
enough for GStreamer not to choke on it.

Removing the queue from the decoder to the sink would also help to
avoid the unnecessary overhead of mutex contention, specially on small
buffers.

Ideally I guess the sink should be able to receive small buffers
without performance penalty, but currently that doesn't seem to be
case.

Cheers.

--
Felipe Contreras

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

Wim Taymans
On Thu, 2009-10-15 at 13:01 +0300, Felipe Contreras wrote:

> On Thu, Oct 15, 2009 at 3:18 AM, René Stadler <[hidden email]> wrote:
> > pl bossart wrote:
> >> That's not necessarily an option. There are 3rd party decoders out
> >> there whose code is not necessarily public. And fixing the decoders is
> >> somewhat odd when the real problem is the sink...
> >
> > The sink is not perfect, but the decoder situation also need work. Current
> > decoders chose the output buffer sizes themselves, and this is wrong. Yes
> > you could change the sink and stitch these buffers together using pad_alloc,
> > but the fact remains that the decoder picks the size and therefore decides
> > on the overhead up to the sink (and all processing elements between decoder
> > and sink).
> >
> > This became apparent to me when Felipe profiled OggVorbis playback with a
> > highly optimized decoder (ffmpeg). Basically the CPU spends an insane amount
> > of time pushing GStreamer buffers around compared the actual audio decoding.
> > And this on the N900, which shows exactly that the current situation is
> > complete nonsense for a battery-powered device.
>
> Indeed. I profiled the audio pipeline and 30% of the time was spent on
> the decoder, the rest was spent pushing buffers around. When I
> increased the buffer sizes pushed by the decoder (128k) efficiency
> increases, now the time spent is 45%, but still, 55% CPU time spent
> pushing buffers around is unacceptable.

It's now also possible to push multiple buffers at the same time by
using the buffer lists. I don't know if that solves anything here, I
guess it depends on the amount of encoded frames that a decoder
receives.

We could for example change the ogg demuxer to push all packets in a
page in a bufferlist and make the vorbisdecoder decode the complete list
before pushing the list of samples to the sink. This would reduce the
amount of objects that get pushed around between elements.

Also, pushing buffers should not be as expensive as 55%, I don't know
what's happening there, maybe the gobject allocation is what's slowing
it down (patches exist for glib) ? maybe the locking/refcounting is slow
(should be simple atomic operations on N900..)? maybe the typechecking
(also patches exist for glib)? It would be nice to know what's causing
this overhead in more detail.

With newer pulse we can't really use buffer_alloc to write directly into
the pulse shared memory from the decoder because there is no api to
allocate such a chunk, there is pa_stream_begin_write() but that can
only be called once AFAIK.

Wim

>
> Profiling what happens on pulseaudio side is an exercise I haven't
> done yet, but as René said, my guess is that there's some ideal buffer
> size that pulseaudio would like to receive from the decoder that's big
> enough for GStreamer not to choke on it.
>
> Removing the queue from the decoder to the sink would also help to
> avoid the unnecessary overhead of mutex contention, specially on small
> buffers.
>
> Ideally I guess the sink should be able to receive small buffers
> without performance penalty, but currently that doesn't seem to be
> case.
>
> Cheers.
>



------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

pl bossart
In reply to this post by René Stadler
> Note that minreq is just the threshold when pulse will ask for more data. You
> are free to send whatever amount is writable when you have data ready, it can
> be smaller or larger than minreq (pulsesink does exactly that).
<snip>
> Yes indeed, in fact the patch gives next to no CPU load improvement. However,
> it leads to the writes from gst to pa being grouped together with larger
> intervals of inactivity in between (tunable with the latency-time property).
> This grouping together results in improved power management. In the N900 I
> measured a penalty of 10% in energy consumption without the patch applied (for
> MP3 on wired headset, display off, i.e. typical long term playback use-case).

My point was that the buffer_time property is used to set the audio
latency, while the latency_time property doesn't set any latency, only
the granularity of the gstreamer processing. This is not exactly
self-explanatory without knowing in detail how PulseAudio works. To be
more consistent, we should rename these properties. In PulseAudio
pacat, the options are called --latency and --process-time, this is a
lot more intuitive than the current gstreamer options.

While I don't have qualitative data, I concur with Felipe's
observations. I have a gstreamer-based audio player running at 8-9%
CPU while a stand-alone player using the same decoding engine needs 5%
in the same conditions (no UI, etc). That's a lot of overhead...

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

Felipe Contreras
In reply to this post by Wim Taymans
On Thu, Oct 15, 2009 at 1:58 PM, Wim Taymans <[hidden email]> wrote:

> On Thu, 2009-10-15 at 13:01 +0300, Felipe Contreras wrote:
>> On Thu, Oct 15, 2009 at 3:18 AM, René Stadler <[hidden email]> wrote:
>> > pl bossart wrote:
>> >> That's not necessarily an option. There are 3rd party decoders out
>> >> there whose code is not necessarily public. And fixing the decoders is
>> >> somewhat odd when the real problem is the sink...
>> >
>> > The sink is not perfect, but the decoder situation also need work. Current
>> > decoders chose the output buffer sizes themselves, and this is wrong. Yes
>> > you could change the sink and stitch these buffers together using pad_alloc,
>> > but the fact remains that the decoder picks the size and therefore decides
>> > on the overhead up to the sink (and all processing elements between decoder
>> > and sink).
>> >
>> > This became apparent to me when Felipe profiled OggVorbis playback with a
>> > highly optimized decoder (ffmpeg). Basically the CPU spends an insane amount
>> > of time pushing GStreamer buffers around compared the actual audio decoding.
>> > And this on the N900, which shows exactly that the current situation is
>> > complete nonsense for a battery-powered device.
>>
>> Indeed. I profiled the audio pipeline and 30% of the time was spent on
>> the decoder, the rest was spent pushing buffers around. When I
>> increased the buffer sizes pushed by the decoder (128k) efficiency
>> increases, now the time spent is 45%, but still, 55% CPU time spent
>> pushing buffers around is unacceptable.
>
> It's now also possible to push multiple buffers at the same time by
> using the buffer lists. I don't know if that solves anything here, I
> guess it depends on the amount of encoded frames that a decoder
> receives.
>
> We could for example change the ogg demuxer to push all packets in a
> page in a bufferlist and make the vorbisdecoder decode the complete list
> before pushing the list of samples to the sink. This would reduce the
> amount of objects that get pushed around between elements.

I don't think that would help. What we need is to push big buffers to
pulsesink, regardless of what we receive on the input. It's not a big
problem for the decoder to fill some temporary buffer.

> Also, pushing buffers should not be as expensive as 55%, I don't know
> what's happening there, maybe the gobject allocation is what's slowing
> it down (patches exist for glib) ? maybe the locking/refcounting is slow
> (should be simple atomic operations on N900..)? maybe the typechecking
> (also patches exist for glib)? It would be nice to know what's causing
> this overhead in more detail.

It seems to me that allocating and unrefing buffers is taking much
more than expected:
http://people.freedesktop.org/~felipec/profile/mp3-1.png

> With newer pulse we can't really use buffer_alloc to write directly into
> the pulse shared memory from the decoder because there is no api to
> allocate such a chunk, there is pa_stream_begin_write() but that can
> only be called once AFAIK.

I think memcpy is the least grave of the problems right now.

--
Felipe Contreras

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

pl bossart
>> We could for example change the ogg demuxer to push all packets in a
>> page in a bufferlist and make the vorbisdecoder decode the complete list
>> before pushing the list of samples to the sink. This would reduce the
>> amount of objects that get pushed around between elements.
>
> I don't think that would help. What we need is to push big buffers to
> pulsesink, regardless of what we receive on the input. It's not a big
> problem for the decoder to fill some temporary buffer.

It does not help cpu- or power-wise to push buffers larger than 64k
into PulseAudio, so the 'big' buffers would be limited to ~370ms or
~14 decoded MP3 frames.
Given this upper bound, how would the decoder know how big the buffers
should really be? If somehow you don't provide the latency information
from the sink back to the decoder, the decoder is going to make
arbitrary decisions no matter what context it is used in. If you are
doing audio only, using large buffers is no issue, but if you are
using the same decoder with video active, you may want to avoid too
large buffers.

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

Felipe Contreras
On Fri, Oct 16, 2009 at 12:21 AM, pl bossart <[hidden email]> wrote:

>>> We could for example change the ogg demuxer to push all packets in a
>>> page in a bufferlist and make the vorbisdecoder decode the complete list
>>> before pushing the list of samples to the sink. This would reduce the
>>> amount of objects that get pushed around between elements.
>>
>> I don't think that would help. What we need is to push big buffers to
>> pulsesink, regardless of what we receive on the input. It's not a big
>> problem for the decoder to fill some temporary buffer.
>
> It does not help cpu- or power-wise to push buffers larger than 64k
> into PulseAudio, so the 'big' buffers would be limited to ~370ms or
> ~14 decoded MP3 frames.

Currently playbin2 adds a queue between the decoder and the sink, so
they run in different threads, and the thread synchronization will
result in more overhead with smaller buffers (not to mention buffer
creation/destruction). So from GStreamer point of view, there's a
direct relationship between buffer size and CPU overhead. I'm not sure
if "too big buffers" would actually impact negatively PA, but my guess
is there's a limit to how big the buffer should be, and that would
actually be the ideal one.

> Given this upper bound, how would the decoder know how big the buffers
> should really be? If somehow you don't provide the latency information
> from the sink back to the decoder, the decoder is going to make
> arbitrary decisions no matter what context it is used in. If you are
> doing audio only, using large buffers is no issue, but if you are
> using the same decoder with video active, you may want to avoid too
> large buffers.

That is true. I haven't thought about the video case, from my point of
view audio-only playback is already too screwed up to think about
that. What's the worst that could happen? A temporary A/V
synchronization miss-match when the video decoder lags behind?

--
Felipe Contreras

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

Lennart Poettering-8
In reply to this post by pl bossart
On Wed, 14.10.09 14:44, pl bossart ([hidden email]) wrote:

> Hi folks,
> I noticed performance issues due to the rewrite of pulsesink since the
> 0.10.15 release. The degradation is in the 30% range on my Atom board
> when playing MP3/AAC. There have been a couple of modifications in git
> related to buffer attributes and latency settings, but overall the
> overhead remains, and the pulsesink code could be further optimized
> for low-power playback apps that don't care about latency.
>
> I finally took the time to look at the code and check what was going
> on. It seems that the overhead is mainly due to the granularity of
> transfers between pulsesink and PulseAudio. What happens is that the
> sink waits for space available in the PulseAudio buffer. When PA
> requests data in a callback, the mainloop unblocks and the sink writes
> its PCM to PulseAudio. The problem is that the sink will not try to
> fill the whole buffer before handing-off the data to PulseAudio. For
> example, say PulseAudio requests 100k (as defined by minreq) and you
> are doing MP3 decode, you are going to send one frame (4608 bytes) at
> a time to PulseAudio until the 100k have been filled. That's a lot of
> overhead. It would be a lot more efficient power-wise to decode and
> store as many frames as possible into the PA buffer before calling
> pa_stream_write().

This is mostly correct. But actually finding the right buffer sizes to
send to PA is a science of its own.

If you have to fill a 2s buffer and you calculate audio for that all
in one step and send it in one packet to PA then you might have to do
some CPU intensive work for quite some time (e.g. decoding AC3) during
which PA might run out of data to play. Which might become a
problem. So the general rule is to do send packets as big as possible
but not to block for that for too long. This is of course a very
imprecise definition.

Also, for optimizing the data tranfer via SHM you shouldn't use memory
blocks larger than 64k right now (actually a little less), which is
the SHM tile size. I probably should export that value in libpulse in
some way, so that the clients can optimize for it, and pass blocks of
size MIN(pa_stream_get_writable_size(), pa_context_get_tile_size()) or
so.

I'll add that in the next release. And I think that block size would
be a good value to optimize the writes for. Unless one starts counting
CPU cycles finding the perfect block size is not possible anyway.

> I have snippets of code as a proof of concept. I don't mind releasing
> the code, but I must admit this is a hack and does not cover all the
> cases pulsesink addresses. An additional optimization could consist in
> passing the PulseAudio buffer upstream to avoid memory copies. The new
> PA release provides support for this with pa_stream_begin_write(). In
> short, I would badly need a review from more experienced developers...
> If anyone is interested let me know.

In fact I added the _begin_write() stuff specifically for use in
GStreamer, after a talk the Gst folks and I had at last FOSDEM.

Lennart

--
Lennart Poettering                        Red Hat, Inc.
lennart [at] poettering [dot] net
http://0pointer.net/lennart/           GnuPG 0x1A015CC4

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

Lennart Poettering-8
In reply to this post by René Stadler
On Thu, 15.10.09 00:48, René Stadler ([hidden email]) wrote:

> >I finally took the time to look at the code and check what was going
> >on. It seems that the overhead is mainly due to the granularity of
> >transfers between pulsesink and PulseAudio. What happens is that the
> >sink waits for space available in the PulseAudio buffer. When PA
> >requests data in a callback, the mainloop unblocks and the sink writes
> >its PCM to PulseAudio. The problem is that the sink will not try to
> >fill the whole buffer before handing-off the data to PulseAudio. For
> >example, say PulseAudio requests 100k (as defined by minreq) and you
> >are doing MP3 decode, you are going to send one frame (4608 bytes) at
> >a time to PulseAudio until the 100k have been filled. That's a lot of
> >overhead. It would be a lot more efficient power-wise to decode and
> >store as many frames as possible into the PA buffer before calling
> >pa_stream_write().
>
> Wim just committed my patch that changes pulsesink back to set the
> minreq to the value of the latency-time property, which lets
> applications tune the gst<->pa overhead again.

In this context: a few days ago I wrote up this wiki page which tries
to explain how to
configure latency properly for pa streams:

http://pulseaudio.org/wiki/LatencyControl

> During the investigation of that regression, I found that there is
> some further things to optimize in pulsesink. I will be filing more
> bugs and sending more patches as I come up with better solutions.

BTW, any chance I could be subscribed automatically to all bugs
regarding pulsesink? Anyone knows if gnome bz can do that?

Lennart

--
Lennart Poettering                        Red Hat, Inc.
lennart [at] poettering [dot] net
http://0pointer.net/lennart/           GnuPG 0x1A015CC4

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

Lennart Poettering-8
In reply to this post by pl bossart
On Wed, 14.10.09 18:13, pl bossart ([hidden email]) wrote:

>
> Howdy Rene',
>
> > Wim just committed my patch that changes pulsesink back to set the minreq to
> > the value of the latency-time property, which lets applications tune the
> > gst<->pa overhead again.
>
> Humm, my experiments show that the core activity increases when minreq
> is > 64k. I sort of remember Lennart mentioning that this was the size
> of the block allocated in PA, and beyond this you would use
> malloc().

Yes that is true (as mentioned in my response i sent 5min ago. I
probably should have read the full thread before responding...)

Story goes like this:

If you call pa_stream_write() with memory blocks < 64k, then PA will
be able to place your entire data in a single SHM tile, and is then
able to send the whole thing in one step to the other side. If you
pick larger sizes, then PA cannot optimize things that way and might
end up sending the data via the socket instead of shm.

Lennart

--
Lennart Poettering                        Red Hat, Inc.
lennart [at] poettering [dot] net
http://0pointer.net/lennart/           GnuPG 0x1A015CC4

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

Lennart Poettering-8
In reply to this post by René Stadler
On Thu, 15.10.09 03:18, René Stadler ([hidden email]) wrote:

> >Besides, it seems to me that the total latency is really defined by
> >tlength, if you increase minreq the size of the server buffer will be
> >adjusted. See Lennart's page at
> >http://pulseaudio.org/wiki/LatencyControl, latency is defined with
> >tlength, minreq has no direct impact on latency.
> >And as I mentioned it, the patch doesn't change the overhead since we
> >keep writing the same size no matter what minreq was set to.
>
> Yes indeed, in fact the patch gives next to no CPU load improvement.
> However, it leads to the writes from gst to pa being grouped
> together with larger intervals of inactivity in between (tunable
> with the latency-time property). This grouping together results in
> improved power management. In the N900 I measured a penalty of 10%
> in energy consumption without the patch applied (for MP3 on wired
> headset, display off, i.e. typical long term playback use-case).

Hm, I wonder if I should formalize that in the PA API. i.e. provide
something that would allow the app to officially declare when one of
those packet "bursts" starts and when it ends? Something like this:

      pa_stream_begin_write_burst();
      pa_stream_write(...);
      pa_stream_write(...);
      pa_stream_write(...);
      pa_stream_write(...);
      pa_stream_begin_end_burst();

And then add a couple of optimizations internally that would already
flush the buffers before the burst is over according to some wallclock
timeout or a a full shm tile or so.

Or maybe that is too complex. Dunno.

Lennart

--
Lennart Poettering                        Red Hat, Inc.
lennart [at] poettering [dot] net
http://0pointer.net/lennart/           GnuPG 0x1A015CC4

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: pulsesink optimizations

Lennart Poettering-8
In reply to this post by Wim Taymans
On Thu, 15.10.09 12:58, Wim Taymans ([hidden email]) wrote:

> With newer pulse we can't really use buffer_alloc to write directly into
> the pulse shared memory from the decoder because there is no api to
> allocate such a chunk, there is pa_stream_begin_write() but that can
> only be called once AFAIK.

Hmm, so you'd like to see some API in PA that allows you to allocate,
ref, unref multiple memory blocks at the same time and independantly
of each other?

Internally that's what happens anyway, so we could definitely add
that. The main reason why I didn't expose this directly is that this
doesn't mix well with allowing the user to pass subsets of the
allocated buffers to pa_stream_write(), unless we add a completely new
function pa_stream_write_preallocated() that takes the buffer pointer,
and index into it and a length. Which would be kinda clumsy, wouldn't
it?

Also I am a bit afraid of how the semantics of PA's and Gst's buffer
handling might collide there. In PA memory blocks can actually change
location in memory at any time. To lock them into an accessible place
for a time you have to call an _acquire() function first and
afterwards a _release() function, and in between your code should not
do communication with other threads or other 'slow' stuff. This would
not mix well with Gst's own buffer handling, would it?

The current PA API makes clear in a way that the buffer should be
allocated using pa_stream_write_begin() only very shortly before the
actual _write(). (or at least it was my intention to make that
clear). But if I export the whole memory block allocation API then
this would be really hard to express in the API so that people whose
memory allocation scheme is incompatible with this would not simply
call _acquire() right away and then delay the _release() until the
very last moment.

The scheme that is now part of the PA API is also designed to be
somewhat similar to how ALSA's mmap() API for playback work: when you
want to write your data you ask for a pointer, and then push your data
into it, and then commit this. I kinda assumed that Gst would gain
support for alsa mmap eventually and it would make sense to follow the
same scheme. What's the plan with that?

Hmm, if all this zero-copy stuff is being rethougt for gst, maybe
keeping alsa mmap io in mind might be a good idea. After all of the
various alsa features the mmap io stuff is actually one that didn't
trigger many problems when PA started to make heavy use of it. So I
can only encourage its use ;-)

Lennart

--
Lennart Poettering                        Red Hat, Inc.
lennart [at] poettering [dot] net
http://0pointer.net/lennart/           GnuPG 0x1A015CC4

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel