possible optimisations in pad_push mutex handling

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

possible optimisations in pad_push mutex handling

Marco Ballesio
Hi,

continuing a discussion begun from Felipe this week, I completed my homework to retrieve a testing patch which disables a few mutexes from gst_pad_push (attached) and profile its effects on both ARM and X86 architectures.

The first positive surprise is that, even after applying the patch, the following command runs properly on both the architectures:

time gst-launch --gst-disable-registry-update audiotestsrc num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000, width=16"  ! audioconvert ! audioconvert ! audioconvert ! fakesink

the second effect is that, running the command a few times and getting a statistic set of results, with both optimisation enabled and disabled, I found that the execution time for the command has a mean improvement of about 2.4% on ARM and about 2% on x86 (Intel Atom processor). Here a few outputs:

On x86, mutexes disabled:
real  0m9.385s user  0m8.773s sys 0m0.076s
real  0m9.206s user  0m8.529s sys 0m0.068s

mutexes enabled:
real  0m9.553s user  0m8.949s sys 0m0.028s
real  0m9.669s user  0m9.017s sys 0m0.092s

On ARM (omap3430), mutexes disabled:
9.57user 0.07system 0:09.68elapsed
9.63user 0.03system 0:09.71elapsed

mutexes enabled:
9.88user 0.07system 0:09.98elapsed
9.80user 0.07system 0:09.90elapsed


Considering a mean VoIP pipeline has a number of elements more than double than the ones in my example, the penalty on a real-life stream-engine pipeline should be proportional.

Some months ago I was studying a way to detect thread boundaries from within a pipeline and then, in case the application "promises" never to interact with it, to choose an optimised path for gst_path_push (then I dropped the work because of other tasks). Would it be interesting to resume such an approach?

P.S. it appears the penalty decreases when buffers have bigger sizes, as already shown from Felipe.

Regards

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel

0001-pad-remove-locks-test.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: possible optimisations in pad_push mutex handling

Wim Taymans
On Sun, 2010-10-10 at 21:40 +0300, Marco Ballesio wrote:

...

> The first positive surprise is that, even after applying the patch,
> the following command runs properly on both the architectures:
>
> time gst-launch --gst-disable-registry-update audiotestsrc
> num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000,
> width=16"  ! audioconvert ! audioconvert ! audioconvert ! fakesink

I'm sure it will go faster but it will cause weird refcounting problems
and crashes when you start linking and unlinking pads during
dataprocessing, which is not acceptable.

<snip>

> Some months ago I was studying a way to detect thread boundaries from
> within a pipeline and then, in case the application "promises" never
> to interact with it, to choose an optimised path for gst_path_push
> (then I dropped the work because of other tasks). Would it be
> interesting to resume such an approach?

Yes, it would be very interesting. I've been looking at how to do this
for a while now. Unfortunately, most ideas involve making the object
flags and the signal counters atomic (which is something I don't think
we can do safely for 0.10).

Maybe the easiest idea is to make an atomic 1 entry cache in
gst_pad_push() that contains the (reffed) peer pad. With the other flag
checks and buffer signals atomic, you can avoid taking and releasing two
locks and 2 atomic refcounts at the expense of 6 (hopefully more simple)
atomic operations.

BTW, before you go down this route, I'm sure you can find a 2%
performance increase elsewhere, like making the recursive locks in glib
use native pthread recursive locks instead of the 7 method calls that it
does now. Or maybe add less contention to the glib type lookups.. Or
maybe improve some of the element processing functions.

>
> P.S. it appears the penalty decreases when buffers have bigger sizes,
> as already shown from Felipe.

That's not what that test showed, it showed that the more buffers you
push per second, the more CPU you consume, which is rather obvious.

>
> Regards
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
> Spend less time writing and  rewriting code and more time creating great
> experiences on the web. Be a part of the beta today.
> http://p.sf.net/sfu/beautyoftheweb
> _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel



------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: possible optimisations in pad_push mutex handling

Stefan Sauer
On 11.10.2010 11:19, Wim Taymans wrote:

> On Sun, 2010-10-10 at 21:40 +0300, Marco Ballesio wrote:
>
> ...
>
>  
>> The first positive surprise is that, even after applying the patch,
>> the following command runs properly on both the architectures:
>>
>> time gst-launch --gst-disable-registry-update audiotestsrc
>> num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000,
>> width=16"  ! audioconvert ! audioconvert ! audioconvert ! fakesink
>>    
> I'm sure it will go faster but it will cause weird refcounting problems
> and crashes when you start linking and unlinking pads during
> dataprocessing, which is not acceptable.
>
> <snip>
>
>  
We should find a way to move the extra complexity to unlinking. Linking
is less of an issues as a unblocked streaming pad is linked.
Some of the complexity in pad_push/get_range is to properly error out if
one unlinks without ensuring that dataflow is blocked or stopped. The
current way how this is implemented is save, but a bit expensive.

>> Some months ago I was studying a way to detect thread boundaries from
>> within a pipeline and then, in case the application "promises" never
>> to interact with it, to choose an optimised path for gst_path_push
>> (then I dropped the work because of other tasks). Would it be
>> interesting to resume such an approach?
>>    
> Yes, it would be very interesting. I've been looking at how to do this
> for a while now. Unfortunately, most ideas involve making the object
> flags and the signal counters atomic (which is something I don't think
> we can do safely for 0.10).
>
> Maybe the easiest idea is to make an atomic 1 entry cache in
> gst_pad_push() that contains the (reffed) peer pad. With the other flag
> checks and buffer signals atomic, you can avoid taking and releasing two
> locks and 2 atomic refcounts at the expense of 6 (hopefully more simple)
> atomic operations.
>
> BTW, before you go down this route, I'm sure you can find a 2%
> performance increase elsewhere, like making the recursive locks in glib
> use native pthread recursive locks instead of the 7 method calls that it
> does now. Or maybe add less contention to the glib type lookups.. Or
> maybe improve some of the element processing functions.
>
>  
Yes, there are many places, but getting improvement in pad_push nicely
helps all over the place :)

Stefan

>> P.S. it appears the penalty decreases when buffers have bigger sizes,
>> as already shown from Felipe.
>>    
> That's not what that test showed, it showed that the more buffers you
> push per second, the more CPU you consume, which is rather obvious.
>
>  
>> Regards
>> ------------------------------------------------------------------------------
>> Beautiful is writing same markup. Internet Explorer 9 supports
>> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
>> Spend less time writing and  rewriting code and more time creating great
>> experiences on the web. Be a part of the beta today.
>> http://p.sf.net/sfu/beautyoftheweb
>> _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
>>    
>
>
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
> Spend less time writing and  rewriting code and more time creating great
> experiences on the web. Be a part of the beta today.
> http://p.sf.net/sfu/beautyoftheweb
> _______________________________________________
> gstreamer-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
>  


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: possible optimisations in pad_push mutex handling

Edward Hervey
Administrator
In reply to this post by Wim Taymans
On Mon, 2010-10-11 at 10:19 +0200, Wim Taymans wrote:
>
> BTW, before you go down this route, I'm sure you can find a 2%
> performance increase elsewhere, like making the recursive locks in
> glib use native pthread recursive locks instead of the 7 method calls
> that it does now.

  Working on that one (native pthread recursive mutex in glib), should
have some useful data by tonight/tomorrow.

>  Or maybe add less contention to the glib type lookups..

  Revived my old native rwlock (based on wtay's original work) to help
potentially speed up all the rwlock usage in gobject/gtype.c

   Edward


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: possible optimisations in pad_push mutex handling

Gruenke, Matt
In reply to this post by Wim Taymans
-----Original Message-----
From: Wim Taymans [mailto:[hidden email]]
Sent: Monday, October 11, 2010 04:20
To: Discussion of the development of GStreamer
Subject: Re: [gst-devel] possible optimisations in pad_push mutex
handling

<snip/>

> Yes, it would be very interesting. I've been looking at how to do this
> for a while now. Unfortunately, most ideas involve making the object
> flags and the signal counters atomic (which is something I don't think
> We can do safely for 0.10).

I like the idea of optimizing the general case.  Would it be possible to
use some sort of hierarchical locking scheme - where a streaming thread
wouldn't grab lower-level locks if it could lock the entire pipeline
segment?

One aspect of locking to watch out for is normal mutexes that are held
for most of the time.  I ran into a case, on Linux 2.6, where the thread
holding the lock was running at significantly higher priority.  The OS
did not preempt the first thread, when it released the mutex, even
though other threads were blocked on obtaining it.

I think the solution might be to use a different type of mutex (e.g.
read/write).

<snip/>

> Or maybe add less contention to the glib type lookups.. Or
> maybe improve some of the element processing functions.

That would sure be nice!

<snip/>


Matt


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: possible optimisations in pad_push mutex handling

Marco Ballesio
In reply to this post by Wim Taymans
Hi,

sorry for the late reply, but these times I've actual few hacking hours a day, indeed not during working hours, with great happiness of my family ;)

On Mon, Oct 11, 2010 at 11:19 AM, Wim Taymans <[hidden email]> wrote:
On Sun, 2010-10-10 at 21:40 +0300, Marco Ballesio wrote:

...

> The first positive surprise is that, even after applying the patch,
> the following command runs properly on both the architectures:
>
> time gst-launch --gst-disable-registry-update audiotestsrc
> num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000,
> width=16"  ! audioconvert ! audioconvert ! audioconvert ! fakesink

I'm sure it will go faster but it will cause weird refcounting problems
and crashes when you start linking and unlinking pads during
dataprocessing, which is not acceptable.

My patch is not definitely meant as a fix, it's just something to show how things could be optimised in some cases.
 

<snip>

> Some months ago I was studying a way to detect thread boundaries from
> within a pipeline and then, in case the application "promises" never
> to interact with it, to choose an optimised path for gst_path_push
> (then I dropped the work because of other tasks). Would it be
> interesting to resume such an approach?

Yes, it would be very interesting. I've been looking at how to do this
for a while now. Unfortunately, most ideas involve making the object
flags and the signal counters atomic (which is something I don't think
we can do safely for 0.10).

Maybe the easiest idea is to make an atomic 1 entry cache in
gst_pad_push() that contains the (reffed) peer pad. With the other flag
checks and buffer signals atomic, you can avoid taking and releasing two
locks and 2 atomic refcounts at the expense of 6 (hopefully more simple)
atomic operations.

mhh, atomic ops have their weight too, but indeed such an optimisation would be better than nothing ;)
 

BTW, before you go down this route, I'm sure you can find a 2%
performance increase elsewhere, like making the recursive locks in glib

please consider the weight is 2% only in this case, where the pipeline is REALLY simple, with 5 elements and 8 pads. A real voip pipeline has much more elements, more than 8 only in the rtpbin. The overhead should be thus proportional to the number of pads (on the ARM about a 20% CPU load increase, mostly in the kernel, has been measured between 60ms and 20ms buffer times).
 
use native pthread recursive locks instead of the 7 method calls that it
does now. Or maybe add less contention to the glib type lookups.. Or
maybe improve some of the element processing functions.

>
> P.S. it appears the penalty decreases when buffers have bigger sizes,
> as already shown from Felipe.

That's not what that test showed, it showed that the more buffers you
push per second, the more CPU you consume, which is rather obvious.

well, not for me ;). In a perfect world the algorithm should increase in complexity wrt the quantity of data, and not depending on how it's partitioned. The algorithm should be O(1) in this terms, and my patch tends to proof this. That is, the most pads are traversed per second, the most mutexes are locked/unlocked. I'd like somebody to give a shot (if possible) to Felipe's test with my patch,but I'm not sure about the overall stability...

Regards,
Marco
 

>
> Regards
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
> Spend less time writing and  rewriting code and more time creating great
> experiences on the web. Be a part of the beta today.
> http://p.sf.net/sfu/beautyoftheweb
> _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel



------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: possible optimisations in pad_push mutex handling

Gruenke, Matt

From: Marco Ballesio [mailto:[hidden email]]
Sent: Monday, October 11, 2010 12:46
To: Discussion of the development of GStreamer
Subject: Re: [gst-devel] possible optimisations in pad_push mutex handling

 

 

<snip/>

 

> P.S. it appears the penalty decreases when buffers have bigger sizes,
> as already shown from Felipe.

That's not what that test showed, it showed that the more buffers you
push per second, the more CPU you consume, which is rather obvious.


> well, not for me ;). In a perfect world the algorithm should increase in complexity wrt

> the quantity of data, and not depending on how it's partitioned. The algorithm should

> be O(1) in this terms, and my patch tends to proof this.

 

Wim is correct (it’s not the actual size of the buffers, but their frequency that matters), but I think your point is that it doesn’t mean there’s not still an issue to be addressed?  Ideally, GStreamer to have no overhead, but it doesn’t (and nothing will).  I think the key point is that the per-buffer overhead is *significant*, for your application, so you’re interested in looking at ways to reduce it.

 

We also have scenarios where GStreamer overhead is significant (i.e. handling lots of compressed streams on a server platform), so there’s broader interest in trying to reduce overhead.  However, since we also modify running pipelines, we’re very interesting in that functionality continuing to work.

 

<snip/>

 

 

Matt

 


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: possible optimisations in pad_push mutex handling

Marco Ballesio
Hi,

..snip..

> well, not for me ;). In a perfect world the algorithm should increase in complexity wrt

> the quantity of data, and not depending on how it's partitioned. The algorithm should

> be O(1) in this terms, and my patch tends to proof this.

 

Wim is correct (it’s not the actual size of the buffers, but their frequency that matters), but I think your point is that it doesn’t mean there’s not still an issue to be addressed?  Ideally, GStreamer to have no overhead, but it doesn’t (and nothing will).


I agree with you: it will never be possible to have a GStreamer completely independent from the data partitioning. Generically speaking, it is possible to have algorithms with such a characteristic, but I can't figure out anything which would have a real applicability, for instance, in the VoIP case.
 

I think the key point is that the per-buffer overhead is *significant*, for your application, so you’re interested in looking at ways to reduce it.


Yep, my point is that we're far from optimality, and I'm pleased in seeing the most prominent community members with so many good ideas. My wish would be to get something with a performance level close to the one of my patch (and I know it's possible to do even better), but without any of the penalties which make it completely useless for anything more than testing.

I'd also like to see find more areas of improvement like this one to be able and get an "absolute minimum" we could get without major architectural changes, but also to propose an optimised architecture for the time to come..
 

 

We also have scenarios where GStreamer overhead is significant (i.e. handling lots of compressed streams on a server platform), so there’s broader interest in trying to reduce overhead.  However, since we also modify running pipelines, we’re very interesting in that functionality continuing to work.


well, after seeing so many good ideas I'm not a great defender of my original proposal, but it would work even in your case as the API includes both a way to use an optimised path and another to go back to the old (unoptimised) one in case the application needs to interact with the pipeline.

Maybe I'll find some time to make the proof-of-concept patches presentable and post them somewhere..

Regards

 

 

<snip/>

 

 

Matt

 


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel



------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel