Hi,
continuing a discussion begun from Felipe this week, I completed my homework to retrieve a testing patch which disables a few mutexes from gst_pad_push (attached) and profile its effects on both ARM and X86 architectures. The first positive surprise is that, even after applying the patch, the following command runs properly on both the architectures: time gst-launch --gst-disable-registry-update audiotestsrc num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000, width=16" ! audioconvert ! audioconvert ! audioconvert ! fakesink the second effect is that, running the command a few times and getting a statistic set of results, with both optimisation enabled and disabled, I found that the execution time for the command has a mean improvement of about 2.4% on ARM and about 2% on x86 (Intel Atom processor). Here a few outputs: On x86, mutexes disabled: real 0m9.385s user 0m8.773s sys 0m0.076s real 0m9.206s user 0m8.529s sys 0m0.068s mutexes enabled: real 0m9.553s user 0m8.949s sys 0m0.028s real 0m9.669s user 0m9.017s sys 0m0.092s On ARM (omap3430), mutexes disabled: 9.57user 0.07system 0:09.68elapsed 9.63user 0.03system 0:09.71elapsed mutexes enabled: 9.88user 0.07system 0:09.98elapsed 9.80user 0.07system 0:09.90elapsed Considering a mean VoIP pipeline has a number of elements more than double than the ones in my example, the penalty on a real-life stream-engine pipeline should be proportional. Some months ago I was studying a way to detect thread boundaries from within a pipeline and then, in case the application "promises" never to interact with it, to choose an optimised path for gst_path_push (then I dropped the work because of other tasks). Would it be interesting to resume such an approach? P.S. it appears the penalty decreases when buffers have bigger sizes, as already shown from Felipe. Regards ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel 0001-pad-remove-locks-test.patch (3K) Download Attachment |
On Sun, 2010-10-10 at 21:40 +0300, Marco Ballesio wrote:
... > The first positive surprise is that, even after applying the patch, > the following command runs properly on both the architectures: > > time gst-launch --gst-disable-registry-update audiotestsrc > num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000, > width=16" ! audioconvert ! audioconvert ! audioconvert ! fakesink I'm sure it will go faster but it will cause weird refcounting problems and crashes when you start linking and unlinking pads during dataprocessing, which is not acceptable. <snip> > Some months ago I was studying a way to detect thread boundaries from > within a pipeline and then, in case the application "promises" never > to interact with it, to choose an optimised path for gst_path_push > (then I dropped the work because of other tasks). Would it be > interesting to resume such an approach? Yes, it would be very interesting. I've been looking at how to do this for a while now. Unfortunately, most ideas involve making the object flags and the signal counters atomic (which is something I don't think we can do safely for 0.10). Maybe the easiest idea is to make an atomic 1 entry cache in gst_pad_push() that contains the (reffed) peer pad. With the other flag checks and buffer signals atomic, you can avoid taking and releasing two locks and 2 atomic refcounts at the expense of 6 (hopefully more simple) atomic operations. BTW, before you go down this route, I'm sure you can find a 2% performance increase elsewhere, like making the recursive locks in glib use native pthread recursive locks instead of the 7 method calls that it does now. Or maybe add less contention to the glib type lookups.. Or maybe improve some of the element processing functions. > > P.S. it appears the penalty decreases when buffers have bigger sizes, > as already shown from Felipe. That's not what that test showed, it showed that the more buffers you push per second, the more CPU you consume, which is rather obvious. > > Regards > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today. > http://p.sf.net/sfu/beautyoftheweb > _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
On 11.10.2010 11:19, Wim Taymans wrote:
> On Sun, 2010-10-10 at 21:40 +0300, Marco Ballesio wrote: > > ... > > >> The first positive surprise is that, even after applying the patch, >> the following command runs properly on both the architectures: >> >> time gst-launch --gst-disable-registry-update audiotestsrc >> num-buffers=60000 blocksize=128 ! "audio/x-raw-int, rate=8000, >> width=16" ! audioconvert ! audioconvert ! audioconvert ! fakesink >> > I'm sure it will go faster but it will cause weird refcounting problems > and crashes when you start linking and unlinking pads during > dataprocessing, which is not acceptable. > > <snip> > > is less of an issues as a unblocked streaming pad is linked. Some of the complexity in pad_push/get_range is to properly error out if one unlinks without ensuring that dataflow is blocked or stopped. The current way how this is implemented is save, but a bit expensive. >> Some months ago I was studying a way to detect thread boundaries from >> within a pipeline and then, in case the application "promises" never >> to interact with it, to choose an optimised path for gst_path_push >> (then I dropped the work because of other tasks). Would it be >> interesting to resume such an approach? >> > Yes, it would be very interesting. I've been looking at how to do this > for a while now. Unfortunately, most ideas involve making the object > flags and the signal counters atomic (which is something I don't think > we can do safely for 0.10). > > Maybe the easiest idea is to make an atomic 1 entry cache in > gst_pad_push() that contains the (reffed) peer pad. With the other flag > checks and buffer signals atomic, you can avoid taking and releasing two > locks and 2 atomic refcounts at the expense of 6 (hopefully more simple) > atomic operations. > > BTW, before you go down this route, I'm sure you can find a 2% > performance increase elsewhere, like making the recursive locks in glib > use native pthread recursive locks instead of the 7 method calls that it > does now. Or maybe add less contention to the glib type lookups.. Or > maybe improve some of the element processing functions. > > helps all over the place :) Stefan >> P.S. it appears the penalty decreases when buffers have bigger sizes, >> as already shown from Felipe. >> > That's not what that test showed, it showed that the more buffers you > push per second, the more CPU you consume, which is rather obvious. > > >> Regards >> ------------------------------------------------------------------------------ >> Beautiful is writing same markup. Internet Explorer 9 supports >> standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. >> Spend less time writing and rewriting code and more time creating great >> experiences on the web. Be a part of the beta today. >> http://p.sf.net/sfu/beautyoftheweb >> _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel >> > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today. > http://p.sf.net/sfu/beautyoftheweb > _______________________________________________ > gstreamer-devel mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/gstreamer-devel > ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
Administrator
|
In reply to this post by Wim Taymans
On Mon, 2010-10-11 at 10:19 +0200, Wim Taymans wrote:
> > BTW, before you go down this route, I'm sure you can find a 2% > performance increase elsewhere, like making the recursive locks in > glib use native pthread recursive locks instead of the 7 method calls > that it does now. Working on that one (native pthread recursive mutex in glib), should have some useful data by tonight/tomorrow. > Or maybe add less contention to the glib type lookups.. Revived my old native rwlock (based on wtay's original work) to help potentially speed up all the rwlock usage in gobject/gtype.c Edward ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
In reply to this post by Wim Taymans
-----Original Message-----
From: Wim Taymans [mailto:[hidden email]] Sent: Monday, October 11, 2010 04:20 To: Discussion of the development of GStreamer Subject: Re: [gst-devel] possible optimisations in pad_push mutex handling <snip/> > Yes, it would be very interesting. I've been looking at how to do this > for a while now. Unfortunately, most ideas involve making the object > flags and the signal counters atomic (which is something I don't think > We can do safely for 0.10). I like the idea of optimizing the general case. Would it be possible to use some sort of hierarchical locking scheme - where a streaming thread wouldn't grab lower-level locks if it could lock the entire pipeline segment? One aspect of locking to watch out for is normal mutexes that are held for most of the time. I ran into a case, on Linux 2.6, where the thread holding the lock was running at significantly higher priority. The OS did not preempt the first thread, when it released the mutex, even though other threads were blocked on obtaining it. I think the solution might be to use a different type of mutex (e.g. read/write). <snip/> > Or maybe add less contention to the glib type lookups.. Or > maybe improve some of the element processing functions. That would sure be nice! <snip/> Matt ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
In reply to this post by Wim Taymans
Hi,
sorry for the late reply, but these times I've actual few hacking hours a day, indeed not during working hours, with great happiness of my family ;) On Mon, Oct 11, 2010 at 11:19 AM, Wim Taymans <[hidden email]> wrote: On Sun, 2010-10-10 at 21:40 +0300, Marco Ballesio wrote: My patch is not definitely meant as a fix, it's just something to show how things could be optimised in some cases.
mhh, atomic ops have their weight too, but indeed such an optimisation would be better than nothing ;)
please consider the weight is 2% only in this case, where the pipeline is REALLY simple, with 5 elements and 8 pads. A real voip pipeline has much more elements, more than 8 only in the rtpbin. The overhead should be thus proportional to the number of pads (on the ARM about a 20% CPU load increase, mostly in the kernel, has been measured between 60ms and 20ms buffer times). use native pthread recursive locks instead of the 7 method calls that it well, not for me ;). In a perfect world the algorithm should increase in complexity wrt the quantity of data, and not depending on how it's partitioned. The algorithm should be O(1) in this terms, and my patch tends to proof this. That is, the most pads are traversed per second, the most mutexes are locked/unlocked. I'd like somebody to give a shot (if possible) to Felipe's test with my patch,but I'm not sure about the overall stability... Regards, Marco
------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
From: Marco Ballesio
[mailto:[hidden email]] <snip/>
> the quantity of data,
and not depending on how it's partitioned.
The algorithm should > be O(1) in this terms,
and my patch tends to proof this. Wim is correct (it’s not the actual size
of the buffers, but their frequency that matters), but I think your point is
that it doesn’t mean there’s not still an issue to be addressed? Ideally,
GStreamer to have no overhead, but it doesn’t (and nothing will). I
think the key point is that the per-buffer overhead is *significant*, for your application, so you’re
interested in looking at ways to reduce it. We also have scenarios where GStreamer
overhead is significant (i.e. handling lots of compressed streams on a server
platform), so there’s broader interest in trying to reduce overhead. However,
since we also modify running pipelines, we’re very interesting in that functionality
continuing to work. <snip/> Matt ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
Hi,
..snip..
I agree with you: it will never be possible to have a GStreamer completely independent from the data partitioning. Generically speaking, it is possible to have algorithms with such a characteristic, but I can't figure out anything which would have a real applicability, for instance, in the VoIP case.
Yep, my point is that we're far from optimality, and I'm pleased in seeing the most prominent community members with so many good ideas. My wish would be to get something with a performance level close to the one of my patch (and I know it's possible to do even better), but without any of the penalties which make it completely useless for anything more than testing. I'd also like to see find more areas of improvement like this one to be able and get an "absolute minimum" we could get without major architectural changes, but also to propose an optimised architecture for the time to come..
well, after seeing so many good ideas I'm not a great defender of my original proposal, but it would work even in your case as the API includes both a way to use an optimised path and another to go back to the old (unoptimised) one in case the application needs to interact with the pipeline. Maybe I'll find some time to make the proof-of-concept patches presentable and post them somewhere.. Regards
------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
Free forum by Nabble | Edit this page |