{"version": "https://jsonfeed.org/version/1", "title": "/dev/posts/ - Tag index - simgrid", "home_page_url": "https://www.gabriel.urdhr.fr", "feed_url": "/tags/simgrid/feed.json", "items": [{"id": "http://www.gabriel.urdhr.fr/2016/08/01/simgrid-synchronisation/", "title": "C++ synchronisations for SimGrid", "url": "https://www.gabriel.urdhr.fr/2016/08/01/simgrid-synchronisation/", "date_published": "2016-08-01T00:00:00+02:00", "date_modified": "2016-08-01T00:00:00+02:00", "tags": ["computer", "simgrid", "c++", "future"], "content_html": "
This is an overview of some recent additions to the SimGrid code\nrelated to actor synchronisation. It might be interesting for people\nusing SimGrid, working on SimGrid or for people interested in generic\nC++ code for synchronisation or asynchronicity.
\nSimGrid is a discrete event simulator of\ndistributed systems: it does not simulate the world by small fixed-size steps\nbut determines the date of the next event (such as the end of a communication,\nthe end of a computation) and jumps to this date.
\nA number of actors executing user-provided code run on top of the\nsimulation kernel[1]. When an actor needs to interact with the simulation\nkernel (eg. to start a communication), it issues a simcall\n(simulation call, an analogy to system calls) to the simulation kernel.\nThis freezes the actor until it is woken up by the simulation kernel\n(eg. when the communication is finished).
\nThe key ideas here are:
\npthread_mutex_lock()
or std::mutex
. The simulation kernel\nwould wait for the actor to issue a simcall and would deadlock. Instead it\nmust use simulation-level synchronisation primitives\n(such as simcall_mutex_lock()
).std::this_thread::sleep_for()
which waits\nin the real world but must instead wait in the simulation with\nsimcall_process_sleep()
which waits in the simulation.We need a generic way to represent asynchronous operations in the\nsimulation kernel. Futures\nare a nice abstraction for this which have been added to a lot languages\n(Java, Python, C++ since C++11, ECMAScript, etc.)[2].
\nA future represents the result of an asynchronous operation. As the operation\nmay not be completed yet, its result is not available yet. Two different sort\nof APIs may be available to expose this future result:
\nres = f.get()
);future.then(something_to_do_with_the_result)
).C++11 includes a generic class (std::future<T>
) which implements a blocking API.\nThe continuation-based API\nis not available in the standard (yet) but is described in the\nConcurrency Technical\nSpecification.
We might want to use a solution based on std::future
but our need is slightly\ndifferent from the C++11 futures. C++11 futures are not suitable for usage inside\nthe simulation kernel because they are only providing a blocking API\n(future.get()
) whereas the simulation kernel cannot block.\nInstead, we need a continuation-based API to be used in our event-driven\nsimulation kernel.
The C++ Concurrency TS describes a continuation-based API.\nOur future are based on this with a few differences[4]:
\nf.wait()
and its variants are not\nmeaningful in this context.future.get()
does an implicit wait. Calling this method in the\nsimulation kernel only makes sense if the future is already ready. If the\nfuture is not ready, this would deadlock the simulator and an error is\nraised instead.future.then()
or promise.set_value()
calls)[5].Future
The implementation of future is in simgrid::kernel::Future
and\nsimgrid::kernel::Promise
[6] and is based on the Concurrency\nTS[7]:
The future and the associated promise use a shared state defined with:
\nenum class FutureStatus {\n not_ready,\n ready,\n done,\n};\n\nclass FutureStateBase : private boost::noncopyable {\npublic:\n void schedule(simgrid::xbt::Task<void()>&& job);\n void set_exception(std::exception_ptr exception);\n void set_continuation(simgrid::xbt::Task<void()>&& continuation);\n FutureStatus get_status() const;\n bool is_ready() const;\n // [...]\nprivate:\n FutureStatus status_ = FutureStatus::not_ready;\n std::exception_ptr exception_;\n simgrid::xbt::Task<void()> continuation_;\n};\n\ntemplate<class T>\nclass FutureState : public FutureStateBase {\npublic:\n void set_value(T value);\n T get();\nprivate:\n boost::optional<T> value_;\n};\n\ntemplate<class T>\nclass FutureState<T&> : public FutureStateBase {\n // ...\n};\ntemplate<>\nclass FutureState<void> : public FutureStateBase {\n // ...\n};\n
\nBoth Future
and Promise
have a reference to the shared state:
template<class T>\nclass Future {\n // [...]\nprivate:\n std::shared_ptr<FutureState<T>> state_;\n};\n\ntemplate<class T>\nclass Promise {\n // [...]\nprivate:\n std::shared_ptr<FutureState<T>> state_;\n bool future_get_ = false;\n};\n
\nThe crux of future.then()
is:
template<class T>\ntemplate<class F>\nauto simgrid::kernel::Future<T>::thenNoUnwrap(F continuation)\n-> Future<decltype(continuation(std::move(*this)))>\n{\n typedef decltype(continuation(std::move(*this))) R;\n\n if (state_ == nullptr)\n throw std::future_error(std::future_errc::no_state);\n\n auto state = std::move(state_);\n // Create a new future...\n Promise<R> promise;\n Future<R> future = promise.get_future();\n // ...and when the current future is ready...\n state->set_continuation(simgrid::xbt::makeTask(\n [](Promise<R> promise, std::shared_ptr<FutureState<T>> state,\n F continuation) {\n // ...set the new future value by running the continuation.\n Future<T> future(std::move(state));\n simgrid::xbt::fulfillPromise(promise,[&]{\n return continuation(std::move(future));\n });\n },\n std::move(promise), state, std::move(continuation)));\n return std::move(future);\n}\n
\nWe added a (much simpler) future.then_()
method which does not\ncreate a new future:
template<class T>\ntemplate<class F>\nvoid simgrid::kernel::Future<T>::then_(F continuation)\n{\n if (state_ == nullptr)\n throw std::future_error(std::future_errc::no_state);\n // Give shared-ownership to the continuation:\n auto state = std::move(state_);\n state->set_continuation(simgrid::xbt::makeTask(\n std::move(continuation), state));\n}\n
\nThe .get()
delegates to the shared state. As we mentioned previously, an\nerror is raised if the future is not ready:
template<class T>\nT simgrid::kernel::Future::get()\n{\n if (state_ == nullptr)\n throw std::future_error(std::future_errc::no_state);\n std::shared_ptr<FutureState<T>> state = std::move(state_);\n return state->get();\n}\n\ntemplate<class T>\nT simgrid::kernel::FutureState<T>::get()\n{\n if (status_ != FutureStatus::ready)\n xbt_die(\"Deadlock: this future is not ready\");\n status_ = FutureStatus::done;\n if (exception_) {\n std::exception_ptr exception = std::move(exception_);\n exception_ = nullptr;\n std::rethrow_exception(std::move(exception));\n }\n xbt_assert(this->value_);\n auto result = std::move(this->value_.get());\n this->value_ = boost::optional<T>();\n return std::move(result);\n}\n
\nSimcalls are not so easy to understand and adding a new one is not so easy\neither. In order to add one simcall, one has to first\nadd it to the list of simcalls\nwhich looks like this:
\n# This looks like C++ but it is a basic IDL-like language\n# (one definition per line) parsed by a python script:\n\nvoid process_kill(smx_process_t process);\nvoid process_killall(int reset_pid);\nvoid process_cleanup(smx_process_t process) [[nohandler]];\nvoid process_suspend(smx_process_t process) [[block]];\nvoid process_resume(smx_process_t process);\nvoid process_set_host(smx_process_t process, sg_host_t dest);\nint process_is_suspended(smx_process_t process) [[nohandler]];\nint process_join(smx_process_t process, double timeout) [[block]];\nint process_sleep(double duration) [[block]];\n\nsmx_mutex_t mutex_init();\nvoid mutex_lock(smx_mutex_t mutex) [[block]];\nint mutex_trylock(smx_mutex_t mutex);\nvoid mutex_unlock(smx_mutex_t mutex);\n\n[...]\n
\nAt runtime, a simcall is represented by a structure containing a simcall\nnumber and its arguments (among some other things):
\nstruct s_smx_simcall {\n // Simcall number:\n e_smx_simcall_t call;\n // Issuing actor:\n smx_process_t issuer;\n // Arguments of the simcall:\n union u_smx_scalar args[11];\n // Result of the simcall:\n union u_smx_scalar result;\n // Some additional stuff:\n smx_timer_t timer;\n int mc_value;\n};\n
\nwith the a scalar union type:
\nunion u_smx_scalar {\n char c;\n short s;\n int i;\n long l;\n long long ll;\n unsigned char uc;\n unsigned short us;\n unsigned int ui;\n unsigned long ul;\n unsigned long long ull;\n double d;\n void* dp;\n FPtr fp;\n};\n
\nThen one has to call (manually \ud83d\ude22) a\nPython script\nwhich generates a bunch of C++ files:
\nstruct s_smx_simcall
;\nand wrapping out the result;struct s_smx_simcall
;Then one has to write the code of the kernel side handler for the simcall\nand the code of the simcall itself (which calls the code-generated\nmarshaling/unmarshaling stuff).
\nIn order to simplify this process, we added two generic simcalls which\ncan be used to execute a function in the simulation kernel context:
\n# This one should really be called run_immediate:\nvoid run_kernel(std::function<void()> const* code) [[nohandler]];\nvoid run_blocking(std::function<void()> const* code) [[block,nohandler]];\n
\nThe first one (simcall_run_kernel()
) executes a function in the simulation\nkernel context and returns immediately (without blocking the actor):
void simcall_run_kernel(std::function<void()> const& code)\n{\n simcall_BODY_run_kernel(&code);\n}\n\ntemplate<class F> inline\nvoid simcall_run_kernel(F& f)\n{\n simcall_run_kernel(std::function<void()>(std::ref(f)));\n}\n
\nOn top of this, we add a wrapper which can be used to return a value of any\ntype and properly handles exceptions:
\ntemplate<class F>\ntypename std::result_of<F()>::type kernelImmediate(F&& code)\n{\n // If we are in the simulation kernel, we take the fast path and\n // execute the code directly without simcall\n // marshalling/unmarshalling/dispatch:\n if (SIMIX_is_maestro())\n return std::forward<F>(code)();\n\n // If we are in the application, pass the code to the simulation\n // kernel which executes it for us and reports the result:\n typedef typename std::result_of<F()>::type R;\n simgrid::xbt::Result<R> result;\n simcall_run_kernel([&]{\n xbt_assert(SIMIX_is_maestro(), \"Not in maestro\");\n simgrid::xbt::fulfillPromise(result, std::forward<F>(code));\n });\n return result.get();\n}\n
\nwhere Result<R>
can store either a R
or an exception.
Example of usage:
\nxbt_dict_t Host::properties() {\n return simgrid::simix::kernelImmediate([&] {\n simgrid::surf::HostImpl* surf_host =\n this->extension<simgrid::surf::HostImpl>();\n return surf_host->getProperties();\n });\n}\n
\nIn this example, the kernelImmediate()
call is not in user code but\nin the framework code. We do not expect the normal user to write\nsimulator kernel code. Those mechanisms are intended to be used by\nthe implementer of the framework in order to implement user\nprimitives.
The second generic simcall (simcall_run_blocking()
) executes a function in\nthe SimGrid simulation kernel immediately but does not wake up the calling actor\nimmediately:
void simcall_run_blocking(std::function<void()> const& code);\n\ntemplate<class F>\nvoid simcall_run_blocking(F& f)\n{\n simcall_run_blocking(std::function<void()>(std::ref(f)));\n}\n
\nThe f
function is expected to setup some callbacks in the simulation\nkernel which will wake up the actor (with\nsimgrid::simix::unblock(actor)
) when the operation is completed.
This is wrapped in a higher-level primitive as well. The\nkernelSync()
function expects a function-object which is executed\nimmediately in the simulation kernel and returns a Future<T>
. The\nsimulator blocks the actor and resumes it when the Future<T>
becomes\nready with its result:
template<class F>\nauto kernelSync(F code) -> decltype(code().get())\n{\n typedef decltype(code().get()) T;\n if (SIMIX_is_maestro())\n xbt_die(\"Can't execute blocking call in kernel mode\");\n\n smx_process_t self = SIMIX_process_self();\n simgrid::xbt::Result<T> result;\n\n simcall_run_blocking([&result, self, &code]{\n try {\n auto future = code();\n future.then_([&result, self](simgrid::kernel::Future<T> value) {\n // Propagate the result from the future\n // to the simgrid::xbt::Result:\n simgrid::xbt::setPromise(result, value);\n simgrid::simix::unblock(self);\n });\n }\n catch (...) {\n // The code failed immediately. We can wake up the actor\n // immediately with the exception:\n result.set_exception(std::current_exception());\n simgrid::simix::unblock(self);\n }\n });\n\n // Get the result of the operation (which might be an exception):\n return result.get();\n}\n
\nA contrived example of this would be:
\nint res = simgrid::simix::kernelSync([&] {\n return kernel_wait_until(30).then(\n [](simgrid::kernel::Future<void> future) {\n return 42;\n }\n );\n});\n
\nA more realistic example (implementing user-level primitives) would\nbe:
\nsg_size_t File::read(sg_size_t size)\n{\n return simgrid::simix::kernelSync([&] {\n return file_->async_read(size);\n });\n}\n
\nWe can write the related kernelAsync()
which wakes up the actor immediately\nand returns a future to the actor. As this future is used in the actor context,\nit is a different future\n(simgrid::simix::Future
instead of simgrid::kernel::Future
)\nwhich implements a C++11 std::future
wait-based API:
template <class T>\nclass Future {\npublic:\n Future() {}\n Future(simgrid::kernel::Future<T> future) : future_(std::move(future)) {}\n bool valid() const { return future_.valid(); }\n T get();\n bool is_ready() const;\n void wait();\nprivate:\n // We wrap an event-based kernel future:\n simgrid::kernel::Future<T> future_;\n};\n
\nThe future.get()
method is implemented as[8]:
template<class T>\nT simgrid::simix::Future<T>::get()\n{\n if (!valid())\n throw std::future_error(std::future_errc::no_state);\n smx_process_t self = SIMIX_process_self();\n simgrid::xbt::Result<T> result;\n simcall_run_blocking([this, &result, self]{\n try {\n // When the kernel future is ready...\n this->future_.then_(\n [this, &result, self](simgrid::kernel::Future<T> value) {\n // ... wake up the process with the result of the kernel future.\n simgrid::xbt::setPromise(result, value);\n simgrid::simix::unblock(self);\n });\n }\n catch (...) {\n result.set_exception(std::current_exception());\n simgrid::simix::unblock(self);\n }\n });\n return result.get();\n}\n
\nkernelAsync()
simply \ud83d\ude09 calls kernelImmediate()
and wraps the\nsimgrid::kernel::Future
into a simgrid::simix::Future
:
template<class F>\nauto kernelAsync(F code)\n -> Future<decltype(code().get())>\n{\n typedef decltype(code().get()) T;\n\n // Execute the code in the simulation kernel and get the kernel future:\n simgrid::kernel::Future<T> future =\n simgrid::simix::kernelImmediate(std::move(code));\n\n // Wrap the kernel future in a user future:\n return simgrid::simix::Future<T>(std::move(future));\n}\n
\nA contrived example of this would be:
\nsimgrid::simix::Future<int> future = simgrid::simix::kernelSync([&] {\n return kernel_wait_until(30).then(\n [](simgrid::kernel::Future<void> future) {\n return 42;\n }\n );\n});\ndo_some_stuff();\nint res = future.get();\n
\nA more realistic example (implementing user-level primitives) would\nbe:
\nsimgrid::simix::Future<sg_size_t> File::async_read(sg_size_t size)\n{\n return simgrid::simix::kernelAsync([&] {\n return file_->async_read(size);\n });\n}\n
\nkernelSync()
could be rewritten as:
template<class F>\nauto kernelSync(F code) -> decltype(code().get())\n{\n return kernelAsync(std::move(code)).get();\n}\n
\nThe semantic is equivalent but this form would require two simcalls\ninstead of one to do the same job (one in kernelAsync()
and one in\n.get()
).
SimGrid uses double
for representing the simulated time:
In contrast, all the C++ APIs use std::chrono::duration
and\nstd::chrono::time_point
. They are used in:
std::this_thread::wait_for()
and std::this_thread::wait_until()
;future.wait_for()
and future.wait_until()
;condvar.wait_for()
and condvar.wait_until()
.We can define future.wait_for(duration)
and future.wait_until(timepoint)
\nfor our futures but for better compatibility with standard C++ code, we might\nwant to define versions expecting std::chrono::duration
and\nstd::chrono::time_point
.
For time points, we need to define a clock (which meets the\nTrivialClock\nrequirements, see\n[time.clock.req]
\nworking in the simulated time in the C++14 standard):
struct SimulationClock {\n using rep = double;\n using period = std::ratio<1>;\n using duration = std::chrono::duration<rep, period>;\n using time_point = std::chrono::time_point<SimulationClock, duration>;\n static constexpr bool is_steady = true;\n static time_point now()\n {\n return time_point(duration(SIMIX_get_clock()));\n }\n};\n
\nA time point in the simulation is a time point using this clock:
\ntemplate<class Duration>\nusing SimulationTimePoint =\n std::chrono::time_point<SimulationClock, Duration>;\n
\nThis is used for example in simgrid::s4u::this_actor::sleep_for()
and\nsimgrid::s4u::this_actor::sleep_until()
:
void sleep_for(double duration)\n{\n if (duration > 0)\n simcall_process_sleep(duration);\n}\n\nvoid sleep_until(double timeout)\n{\n double now = SIMIX_get_clock();\n if (timeout > now)\n simcall_process_sleep(timeout - now);\n}\n\ntemplate<class Rep, class Period>\nvoid sleep_for(std::chrono::duration<Rep, Period> duration)\n{\n auto seconds =\n std::chrono::duration_cast<SimulationClockDuration>(duration);\n this_actor::sleep_for(seconds.count());\n}\n\ntemplate<class Duration>\nvoid sleep_until(const SimulationTimePoint<Duration>& timeout_time)\n{\n auto timeout_native =\n std::chrono::time_point_cast<SimulationClockDuration>(timeout_time);\n this_actor::sleep_until(timeout_native.time_since_epoch().count());\n}\n
\nWhich means it is possible to use (since C++14):
\nusing namespace std::chrono_literals;\nsimgrid::s4u::actor::sleep_for(42s);\n
\nSimGrid has had a C-based API for mutexes and condition variables for\nsome time. These mutexes are different from the standard\nsystem-level mutex (std::mutex
, pthread_mutex_t
, etc.) because\nthey work at simulation-level. Locking on a simulation mutex does\nnot block the thread directly but makes a simcall\n(simcall_mutex_lock()
) which asks the simulation kernel to wake the calling\nactor when it can get ownership of the mutex. Blocking directly at the\nOS level would deadlock the simulation.
Reusing the C++ standard API for our simulation mutexes has many\nbenefits:
\nstd::mutex
to\nunderstand and use SimGrid mutexes;We defined a reference-counted Mutex
class for this (which supports\nthe Lockable
\nrequirements, see\n[thread.req.lockable.req]
\nin the C++14 standard):
class Mutex {\n friend ConditionVariable;\nprivate:\n friend simgrid::simix::Mutex;\n simgrid::simix::Mutex* mutex_;\n Mutex(simgrid::simix::Mutex* mutex) : mutex_(mutex) {}\npublic:\n\n friend void intrusive_ptr_add_ref(Mutex* mutex);\n friend void intrusive_ptr_release(Mutex* mutex);\n using Ptr = boost::intrusive_ptr<Mutex>;\n\n // No copy:\n Mutex(Mutex const&) = delete;\n Mutex& operator=(Mutex const&) = delete;\n\n static Ptr createMutex();\n\npublic:\n void lock();\n void unlock();\n bool try_lock();\n};\n
\nThe methods are simply wrappers around existing simcalls:
\nvoid Mutex::lock()\n{\n simcall_mutex_lock(mutex_);\n}\n
\nUsing the same API as std::mutex
(Lockable
) means we can use existing\nC++-standard code such as std::unique_lock<Mutex>
or\nstd::lock_guard<Mutex>
for exception-safe mutex handling[9]:
{\n std::lock_guard<simgrid::s4u::Mutex> lock(*mutex);\n sum += 1;\n}\n
\nSimilarly SimGrid already had simulation-level condition variables\nwhich can be exposed using the same API as std::condition_variable
:
class ConditionVariable {\nprivate:\n friend s_smx_cond;\n smx_cond_t cond_;\n ConditionVariable(smx_cond_t cond) : cond_(cond) {}\npublic:\n\n ConditionVariable(ConditionVariable const&) = delete;\n ConditionVariable& operator=(ConditionVariable const&) = delete;\n\n friend void intrusive_ptr_add_ref(ConditionVariable* cond);\n friend void intrusive_ptr_release(ConditionVariable* cond);\n using Ptr = boost::intrusive_ptr<ConditionVariable>;\n static Ptr createConditionVariable();\n\n void wait(std::unique_lock<Mutex>& lock);\n template<class P>\n void wait(std::unique_lock<Mutex>& lock, P pred);\n\n // Wait functions taking a plain double as time:\n\n std::cv_status wait_until(std::unique_lock<Mutex>& lock,\n double timeout_time);\n std::cv_status wait_for(\n std::unique_lock<Mutex>& lock, double duration);\n template<class P>\n bool wait_until(std::unique_lock<Mutex>& lock,\n double timeout_time, P pred);\n template<class P>\n bool wait_for(std::unique_lock<Mutex>& lock,\n double duration, P pred);\n\n // Wait functions taking a std::chrono time:\n\n template<class Rep, class Period, class P>\n bool wait_for(std::unique_lock<Mutex>& lock,\n std::chrono::duration<Rep, Period> duration, P pred);\n template<class Rep, class Period>\n std::cv_status wait_for(std::unique_lock<Mutex>& lock,\n std::chrono::duration<Rep, Period> duration);\n template<class Duration>\n std::cv_status wait_until(std::unique_lock<Mutex>& lock,\n const SimulationTimePoint<Duration>& timeout_time);\n template<class Duration, class P>\n bool wait_until(std::unique_lock<Mutex>& lock,\n const SimulationTimePoint<Duration>& timeout_time, P pred);\n\n // Notify:\n\n void notify_one();\n void notify_all();\n\n};\n
\nWe currently accept both double
(for simplicity and consistency with\nthe current codebase) and std::chrono
types (for compatibility with\nC++ code) as durations and timepoints. One important thing to notice here is\nthat cond.wait_for()
and cond.wait_until()
work in the simulated time,\nnot in the real time.
The simple cond.wait()
and cond.wait_for()
delegate to\npre-existing simcalls:
void ConditionVariable::wait(std::unique_lock<Mutex>& lock)\n{\n simcall_cond_wait(cond_, lock.mutex()->mutex_);\n}\n\nstd::cv_status ConditionVariable::wait_for(\n std::unique_lock<Mutex>& lock, double timeout)\n{\n // The simcall uses -1 for \"any timeout\" but we don't want this:\n if (timeout < 0)\n timeout = 0.0;\n\n try {\n simcall_cond_wait_timeout(cond_, lock.mutex()->mutex_, timeout);\n return std::cv_status::no_timeout;\n }\n catch (xbt_ex& e) {\n\n // If the exception was a timeout, we have to take the lock again:\n if (e.category == timeout_error) {\n try {\n lock.mutex()->lock();\n return std::cv_status::timeout;\n }\n catch (...) {\n std::terminate();\n }\n }\n\n std::terminate();\n }\n catch (...) {\n std::terminate();\n }\n}\n
\nOther methods are simple wrappers around those two:
\ntemplate<class P>\nvoid ConditionVariable::wait(std::unique_lock<Mutex>& lock, P pred)\n{\n while (!pred())\n wait(lock);\n}\n\ntemplate<class P>\nbool ConditionVariable::wait_until(std::unique_lock<Mutex>& lock,\n double timeout_time, P pred)\n{\n while (!pred())\n if (this->wait_until(lock, timeout_time) == std::cv_status::timeout)\n return pred();\n return true;\n}\n\ntemplate<class P>\nbool ConditionVariable::wait_for(std::unique_lock<Mutex>& lock,\n double duration, P pred)\n{\n return this->wait_until(lock,\n SIMIX_get_clock() + duration, std::move(pred));\n}\n
\nWe wrote two future implementations based on the std::future
API:
future.then(stuff)
)\nfuture used inside our (non-blocking event-based) simulation kernel;future.get()
) future used in the actors\nwhich waits using a simcall.These futures are used to implement kernelSync()
and kernelAsync()
which\nexpose asynchronous operations in the simulation kernel to the actors.
In addition, we wrote variations of some other C++ standard library\nclasses (SimulationClock
, Mutex
, ConditionVariable
) which work in\nthe simulation:
Reusing the same API as the C++ standard library is very useful because:
\nstd::unique_lock
,\nstd::lock_guard
, etc.).This type of approach might be useful for other libraries which define\ntheir own contexts. An example of this is\nMordor, a I/O library using fibers\n(cooperative scheduling): it implements cooperative/fiber\nmutex,\nrecursive\nmutex\nwhich are compatible with the\nBasicLockable
\nrequirements (see\n[thread.req.lockable.basic]
\nin the C++14 standard).
Result
Result is like a mix of std::future
and std::promise
in a\nsingle-object without shared-state and synchronisation:
template<class T>\nclass Result {\n enum class ResultStatus {\n invalid,\n value,\n exception,\n };\npublic:\n Result();\n ~Result();\n Result(Result const& that);\n Result& operator=(Result const& that);\n Result(Result&& that);\n Result& operator=(Result&& that);\n bool is_valid() const;\n void reset();\n void set_exception(std::exception_ptr e);\n void set_value(T&& value);\n void set_value(T const& value);\n T get();\nprivate:\n ResultStatus status_ = ResultStatus::invalid;\n union {\n T value_;\n std::exception_ptr exception_;\n };\n};\n
\nThose helper are useful for dealing with generic future-based code:
\ntemplate<class R, class F>\nauto fulfillPromise(R& promise, F&& code)\n-> decltype(promise.set_value(code()))\n{\n try {\n promise.set_value(std::forward<F>(code)());\n }\n catch(...) {\n promise.set_exception(std::current_exception());\n }\n}\n\ntemplate<class P, class F>\nauto fulfillPromise(P& promise, F&& code)\n-> decltype(promise.set_value())\n{\n try {\n std::forward<F>(code)();\n promise.set_value();\n }\n catch(...) {\n promise.set_exception(std::current_exception());\n }\n}\n\ntemplate<class P, class F>\nvoid setPromise(P& promise, F&& future)\n{\n fulfillPromise(promise, [&]{ return std::forward<F>(future).get(); });\n}\n
\nTask<R(F...)>
is a type-erased callable object similar to\nstd::function<R(F...)>
but works for move-only types. It is similar to\nstd::package_task<R(F...)>
but does not wrap the result in a std::future<R>
\n(it is not packaged).
\n | std::function | \nstd::packaged_task | \nsimgrid::xbt::Task | \n
---|---|---|---|
Copyable | \nYes | \nNo | \nNo | \n
Movable | \nYes | \nYes | \nYes | \n
Call | \nconst | \nnon-const | \nnon-const | \n
Callable | \nmultiple times | \nonce | \nonce | \n
Sets a promise | \nNo | \nYes | \nNo | \n
It could be implemented as:
\ntemplate<class T>\nclass Task {\nprivate:\n std::packaged_task<T> task_;\npublic:\n\n template<class F>\n void Task(F f) :\n task_(std::forward<F>(f))\n {}\n\n template<class... ArgTypes>\n auto operator()(ArgTypes... args)\n -> decltype(task_.get_future().get())\n {\n task_(std::forward<ArgTypes)(args)...);\n return task_.get_future().get();\n }\n\n};\n
\nbut we don't need a shared-state.
\nThis is useful in order to bind move-only type arguments:
\ntemplate<class F, class... Args>\nclass TaskImpl {\nprivate:\n F code_;\n std::tuple<Args...> args_;\n typedef decltype(simgrid::xbt::apply(\n std::move(code_), std::move(args_))) result_type;\npublic:\n TaskImpl(F code, std::tuple<Args...> args) :\n code_(std::move(code)),\n args_(std::move(args))\n {}\n result_type operator()()\n {\n // simgrid::xbt::apply is C++17 std::apply:\n return simgrid::xbt::apply(std::move(code_), std::move(args_));\n }\n};\n\ntemplate<class F, class... Args>\nauto makeTask(F code, Args... args)\n-> Task< decltype(code(std::move(args)...))() >\n{\n TaskImpl<F, Args...> task(\n std::move(code), std::make_tuple(std::move(args)...));\n return std::move(task);\n}\n
\nUpate (2018-08-15): there is a\nproposal\nfor including this as std::unique_function
in the C++ standard.\nIn addition to the implementations listed in the paper, there is also\nfolly::Function
\nor stdlab::task
.\nThere is a later proposal\nfor extending std::function
\nwith non-copyable move-only types and one shot call\nwith eg. std::function<void()&&>
.
Update (2023-07-08):\nC++23 features std::move_only_function<R(...)>
\nwhich is similar.\nIn contrast to xbt::Task<R(...)>
,\nstd::move_only_function<R(...)>
can be called multiple times.
The relationship between the SimGrid simulation kernel and the simulated\nactors is similar to the relationship between a OS kernel and the OS\nprocesses: the simulation kernel manages (schedules) the execution of the\nactors; the actors make requests to the simulation kernel using simcalls.\nHowever, both the simulation kernel and the actors currently run in the same\nOS process (and use same address space). \u21a9\ufe0e
\nThere is an interesting library implementation in\nRust as well. \u21a9\ufe0e
\nThis is the kind of futures that are available in ECMAScript which use\nthe same kind of never-blocking asynchronous model as our discrete event\nsimulator. \u21a9\ufe0e
\n(which are related to the fact that we are in a non-blocking single-threaded\nsimulation engine) \u21a9\ufe0e
\nCalling the continuations from simulation loop means that we don't have\nto fear problems like invariants not being restored when the callbacks\nare called \ud83d\ude28 or stack overflows triggered by deeply nested\ncontinuations chains \ud83d\ude30. The continuations are all called in a\nnice and predictable place in the simulator with a nice and predictable\nstate \ud83d\ude0c. \u21a9\ufe0e
\nIn the C++ standard library, std::future<T>
is used by the consumer\nof the result. On the other hand, std::promise<T>
is used by the\nproducer of the result. The consumer calls promise.set_value(42)
\nor promise.set_exception(e)
in order to set the result which will\nbe made available to the consumer by future.get()
. \u21a9\ufe0e
Currently, we did not implement some features such as shared\nfutures. \u21a9\ufe0e
\nYou might want to compare this method with simgrid::kernel::Future::get()
\nwe showed previously: the method of the kernel future does not block and\nraises an error if the future is not ready; the method of the actor future\nblocks after having set a continuation to wake the actor when the future\nis ready. \u21a9\ufe0e
std::lock()
might kinda work too but it may not be such as good idea to\nuse it as it may use a deadlock avoidance algorithm such as\ntry-and-back-off
.\nA backoff would probably uselessly wait in real time instead of simulated\ntime. The deadlock avoidance algorithm might as well add non-determinism\nin the simulation which we would like to avoid.\nstd::try_lock()
should be safe to use though. \u21a9\ufe0e
FlameGraph\nis used to display stack trace samples but we can ue it for\nother purposes as well.
\nFor example, we can quite simply display where are the lines of code\nof a project:
\ncloc --csv-delimiter=\"$(printf '\\t')\" --by-file --quiet --csv src/ include/ |\nsed '1,2d' |\ncut -f 2,5 |\nsed 's/\\//;/g' |\n./flamegraph.pl\n
\n\nRR is a very useful tool for debugging. It\ncan record the execution of a program and then replay the exact same\nexecution at will inside a debugger. One very useful extra power\navailable since 4.0 is the support for efficient reverse\nexecution\nwhich can be used to find the root cause of a bug in your program\nby rewinding time. In this example, we reverse-execute a program from a\ncase of use-after-free in order to find where the block of memory was\nfreed.
\n$ rr record ./foo my_args\n$ rr replay\n(rr) continue\n(rr) break free if $rdi == some_address\n(rr) reverse-continue\n
\nWe have a case of use-after-free:
\n$ gdb --args java -classpath \"$classpath\" surfCpuModel/TestCpuModel \\\n small_platform.xml surfCpuModelDeployment.xml \\\n --cfg=host/model:compound\n\n(gdb) run\n[\u2026]\n\nProgram received signal SIGSEGV, Segmentation fault.\n[Switching to Thread 0x7ffff7fbb700 (LWP 12766)]\n0x00007fffe4fe3fb7 in xbt_dynar_map (dynar=0x7ffff0276ea0, op=0x56295a443b6c65) at /home/gabriel/simgrid/src/xbt/dynar.c:603\n603\t op(elm);\n\n(gdb) p *dynar\n$2 = {size = 2949444837771837443, used = 3415824664728436765,\n elmsize = 3414970357536090483, data = 0x646f4d2f66727573,\n free_f = 0x56295a443b6c65}\n
\nThe fields of this structure are all wrong and we suspect than this\nblock of heap was already freed and reused by another allocation.
\nWe could use GDB with a conditional breakpoint of free(ptr)
with\nptr == dynar
but this approach poses a few problems:
setarch -R
,free()
for this specific\naddress for previous allocations before we reach the correct one.RR can be used to create a recording of a given execution of the\nprogram. This execution can then be replayed exactly inside a\ndebugger. This fixes our first problem.
\nLet's record our crash in RR:
\n$ rr record java -classpath \"$classpath\" surfCpuModel/TestCpuModel \\\n small_platform.xml surfCpuModelDeployment.xml \\\n --cfg=host/model:compound\n[\u2026]\n# A fatal error has been detected by the Java Runtime Environment:\n[\u2026]\n
\nNow we can replay the exact same execution over and over gain in a special\nGDB session:
\n$ rr replay\n(rr) continue\nContinuing.\n[\u2026]\n\nProgram received signal SIGSEGV, Segmentation fault.\n[Switching to Thread 12601.12602]\n0x00007fe94761efb7 in xbt_dynar_map (dynar=0x7fe96c24f350, op=0x56295a443b6c65) at /home/gabriel/simgrid/src/xbt/dynar.c:603\n603\t op(elm);\n
\nWe want to know who freed this block of memory. RR 4.0 provides\nsupport for efficient reverse-execution which can be used to solve our\nsecond problem.
\nLet's set a conditional breakpoint on free()
:
(rr) p dynar\n$1 = (const xbt_dynar_t) 0x7fe96c24f350\n\n(rr) break free if $rdi == 0x7fe96c24f350\n
\nNote: This is for x86_64.\nIn the x86_64 ABI,\nthe RDI
register is used to pass the first parameter.
Now we can use RR super powers by reverse-executing the program until\nwe find who freed this block of memory:
\n\n(rr) reverse-continue\nContinuing.\nProgram received signal SIGSEGV, Segmentation fault.\n[\u2026]\n\n(rr) reverse-continue\nContinuing.\nBreakpoint 1, __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n2917\tmalloc.c: Aucun fichier ou dossier de ce type.\n\n(bt) backtrace\n#0 __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n#1 0x00007fe96b18486d in ZIP_FreeEntry (jz=0x7fe96c0f43d0, ze=0x7fe96c24f6e0) at ../../../src/share/native/java/util/zip/zip_util.c:1104\n#2 0x00007fe968191d78 in ?? ()\n#3 0x00007fe96818dcbb in ?? ()\n#4 0x0000000000000002 in ?? ()\n#5 0x00007fe96c24f6e0 in ?? ()\n#6 0x000000077ab0c2d8 in ?? ()\n#7 0x00007fe970641a80 in ?? ()\n#8 0x0000000000000000 in ?? ()\n\n(rr) reverse-continue\nContinuing.\nBreakpoint 1, __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n2917\tin malloc.c\n\n(rr) backtrace\n#0 __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n#1 0x00007fe94761f28e in xbt_dynar_to_array (dynar=0x7fe96c24f350) at /home/gabriel/simgrid/src/xbt/dynar.c:691\n#2 0x00007fe946b98a2f in SwigDirector_CpuModel::createCpu (this=0x7fe96c14d850, name=0x7fe96c156862 \"Tremblay\", power_peak=0x7fe96c24f350, pstate=0, \n power_scale=1, power_trace=0x0, core=1, state_initial=SURF_RESOURCE_ON, state_trace=0x0, cpu_properties=0x0)\n at /home/gabriel/simgrid/src/bindings/java/org/simgrid/surf/surfJAVA_wrap.cxx:1571\n#3 0x00007fe947531615 in cpu_parse_init (host=0x7fe9706456d0) at /home/gabriel/simgrid/src/surf/cpu_interface.cpp:44\n#4 0x00007fe947593f88 in sg_platf_new_host (h=0x7fe9706456d0) at /home/gabriel/simgrid/src/surf/sg_platf.c:138\n#5 0x00007fe9475e54fb in ETag_surfxml_host () at /home/gabriel/simgrid/src/surf/surfxml_parse.c:481\n#6 0x00007fe9475da1dc in surf_parse_lex () at src/surf/simgrid_dtd.c:7093\n#7 0x00007fe9475e84f2 in _surf_parse () at /home/gabriel/simgrid/src/surf/surfxml_parse.c:1068\n#8 0x00007fe9475e8cfa in parse_platform_file (file=0x7fe96c14f1e0 \"/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml\")\n at /home/gabriel/simgrid/src/surf/surfxml_parseplatf.c:172\n#9 0x00007fe9475142f4 in SIMIX_create_environment (file=0x7fe96c14f1e0 \"/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml\")\n at /home/gabriel/simgrid/src/simix/smx_environment.c:39\n#10 0x00007fe9474cd98f in MSG_create_environment (file=0x7fe96c14f1e0 \"/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml\")\n at /home/gabriel/simgrid/src/msg/msg_environment.c:37\n#11 0x00007fe94686c473 in Java_org_simgrid_msg_Msg_createEnvironment (env=0x7fe96c00a1d8, cls=0x7fe9706459a8, jplatformFile=0x7fe9706459b8)\n at /home/gabriel/simgrid/src/bindings/java/jmsg.c:203\n#12 0x00007fe968191d78 in ?? ()\n#13 0x00000007fffffffe in ?? ()\n#14 0x00007fe970645958 in ?? ()\n#15 0x00000007f5cd1100 in ?? ()\n#16 0x00007fe9706459b8 in ?? ()\n#17 0x00000007f5cd1738 in ?? ()\n#18 0x0000000000000000 in ?? ()\n\n
Now that we have found the offending free()
call we can inspect the state\nof the program:
\n(rr) frame 1\n#1 0x00007fe94761f28e in xbt_dynar_to_array (dynar=0x7fe96c24f350) at /home/gabriel/simgrid/src/xbt/dynar.c:691\n691\t free(dynar);\n\n(rr) list\n686\t{\n687\t void *res;\n688\t xbt_dynar_shrink(dynar, 1);\n689\t memset(xbt_dynar_push_ptr(dynar), 0, dynar->elmsize);\n690\t res = dynar->data;\n691\t free(dynar);\n692\t return res;\n693\t}\n694\n695\t/** @brief Compare two dynars\n\n
If necessary we could continue reverse-executing in order to understand\nbetter what caused the problem.
\nWhile GDB has builtin support for reverse\nexecution,\ndoing the same thing in GDB is much slower. Moreover, recording\nthe execution fills the GDB record buffer quite rapidly which prevents\nus from recording a large execution: with the native support of GDB\nwe would probably need to narrow down the region when the bug appeared\nin order to only record (and the reverse-execute) a small part of the\nexecution of the program.
\nIn my previous SimGrid post, I\ntalked about different solutions for a better isolation between the\nmodel-checked application and the model-checker. We chose to avoid\nthe (hackery) solution based multiple dynamic-linker namespaces in the\nsame process and use a more conventional process-based isolation.
\nIn the previous version of the SimGridMC, the model-checker was\nrunning in the same process as the main SimGrid application. We had in\nthe same process:
\nall the simulated processes (containing the local state of each\nprocess);
\nthe SimGrid simulator (containing the shared/global state such as\nthe state of the communications);
\nthe model-checker (containing the state of the exploration in the\nexecution graph of the simulated application) which had to\ncheckpoint/restore the state of the other components (but not its\nown state).
\nIn order to do this, the SimGridMC process was using two different\nmalloc()
-heaps in the same process in order to separate:
the state of the simulated application (processes states and global\nstate);
\nthe state of the model-checker.
\nThe model-checker code had a lot of code to select which heap had to\nbe active (and used by malloc()
and friends) at a given point of the\ncode.
This is an example of a function with a lot of heap management calls\n(the lines managing the heap swapping are commented with <*>
):
void MC_pre_modelcheck_safety()\n{\n\n int mc_mem_set = (mmalloc_get_current_heap() == mc_heap); // <*>\n\n mc_state_t initial_state = NULL;\n smx_process_t process;\n\n /* Create the initial state and push it into the exploration stack */\n if (!mc_mem_set) // <*>\n MC_SET_MC_HEAP; // <*>\n\n if (_sg_mc_visited > 0)\n visited_states = xbt_dynar_new(sizeof(mc_visited_state_t),\n visited_state_free_voidp);\n\n initial_state = MC_state_new();\n\n MC_SET_STD_HEAP; // <*>\n\n /* Wait for requests (schedules processes) */\n MC_wait_for_requests();\n\n MC_SET_MC_HEAP; // <*>\n\n /* Get an enabled process and insert it in the interleave set\n of the initial state */\n xbt_swag_foreach(process, simix_global->process_list) {\n if (MC_process_is_enabled(process)) {\n MC_state_interleave_process(initial_state, process);\n if (mc_reduce_kind != e_mc_reduce_none)\n break;\n }\n }\n\n xbt_fifo_unshift(mc_stack, initial_state);\n\n if (!mc_mem_set) // <*>\n MC_SET_STD_HEAP; // <*>\n}\n
\nThe heap management code was cumbersome and difficult to maintain: it\nwas necessary to known which function had to be called in each\ncontext, which function was selecting the correct heap and select the\ncurrent heap accordingly. It was moreover necessary to known which\ndata was allocated in which heap. Failing to use the correct heap\ncould lead to errors such as:
\nfree()
because the memory was not malloc()
-ed with the\ncurrent heap;While this design was interesting for the performance of the\nmodel-checker, it was quite difficult to maintain and understand. We\nwanted to create a new version of the model-checker which would be\nsimpler to understand and maintain:
\nIn order to avoid the coexistence of the two heaps we envisioned two\npossible solutions:
\nWhile the dynamic-linker based solution is quite interesting and would\nprovide better performance by avoiding context switches (and who\ndoesn't want to write their own dynamic linker?), it would probably be\ndifficult to achieve and would probably not make the code easier to\nunderstand.
\nWe chose to use the much more standard solution of using different\nprocesses which is conceptually much simpler and provides a better\nisolation between the model-checker and the model-checked application.\nWith this design, the model-checker is a quite standard process: all\ndebugging tools can be used without any problem (Valgrind, GDB) on the\nmodel-checker process. The model-checked process is not completely\nstandard as we are constantly overwriting its state but we can still\nptrace it and use a debugger.
\nUpdate (2016-04-01): the model-checker now ptrace
s the\nmodel-checked application (for various reasons) and it is not possible\nto debug the model-checked application anymore. However, we have a\nfeature to replay an execution of the model-checked application\noutside of the model-checker.
In this new design, the model-checker process behaves somehow like a\ndebugger for the simulated (model-checked) application by monitoring\nand controlling its execution. The model-checker process is\nresponsible for:
\nThe simulated application is responsible for:
\nTwo mechanisms are used to implement the interaction between the\nmodel-checker process and the model-checked application:
\n/proc/$pid/mem
(this is used for\nsnapshot/restore and in order to look at the state of the\nmodel-checked application).Since Linux 3.2, it is possible to read from and write to another\nprocess virtual\nmemory\nwithout ptrace()
-ing it: I took care not to use ptrace()
in order\nto be able to use it from another purpose (a process can only be\nptraced by a single process at a time):
The split has been done in two phases:
\nThe model-checker process and the model-checked process application\ncommunicate with each other over a UNIX datagram socket. This socket\nis created by the model-checker and passed to the child model-checked\nprocess.
\nThis is used in the initialisation:
\nThis is used in runtime to control the execution of the model-checked\napplication:
\nThe (simplified) client-loop looks like this:
\nvoid MC_client_main_loop(void)\n{\n while (1) {\n message_type message;\n size_t = receive_message(&message);\n switch(message.type()) {\n\n // Executes a simcall:\n case MC_MESSAGE_SIMCALL_HANDLE:\n execute_transition(message.transition());\n send_message(MC_MESSAGE_WAITING);\n break;\n\n // Execute application code until a visible simcall is reached:\n case MC_MESSAGE_CONTINUE:\n execute_application_code();\n send_message(MC_MESSAGE_WAITING);\n break;\n\n // [...] (Other messages here)\n } \n }\n}\n
\nEach model-checking algorithm (safety, liveness, communication\ndeterminism) is implemented as model-checker side code which triggers\nexecution of model-checked-side transitions with:
\n// Execute a simcall (MC_MESSAGE_SIMCALL_HANDLE):\nMC_simcall_handle(req, value);\n\n// Execute simulated application code (MC_MESSAGE_CONTINUE):\nMC_wait_for_requests();\n
\nThe communication determinism algorithm needs to see the result of\nsome simcalls before triggering the application code:
\nMC_simcall_handle(req, value);\nMC_handle_comm_pattern(call, req, value, communication_pattern, 0);\nMC_wait_for_requests();\n
\nSnapshot and restoration is handled by reading/writing the\nmodel-checked process memory with /proc/$pid/memory
. During this\noperation, the model-checked process is waiting for messages on a\nspecial stack dedicated to the simulator (which is not managed by the\nsnapshotting logic). During this time, the model-checked application\nis not supposed to be accessing the simulated application memory.\nWhen this is finished, the model-checker wakes up the simulated\napplication with the MC_MESSAGE_SIMCALL_HANDLE
and\nMC_MESSAGE_CONTINUE
.
The model-checker needs to read some of the state of the simulator\n(state of the communications, name of the processes and so on).\nCurrently this is handled quite brutally by reading the data directly\nin the structures of the model-checked process (following linked-list\nitems, arrays elements, etc. from the remote process):
\n// Read the hostname from the MCed process:\nprocess->read_bytes(&host_copy, sizeof(host_copy), remote(p->host));\nint len = host_copy.key_len + 1;\nchar hostname[len];\nprocess->read_bytes(hostname, len, remote(host_copy.key));\ninfo->hostname = mc_model_checker->get_host_name(hostname);\n
\nThis is quite ugly and should probably be replaced by some more\nstructured way to share this information in the future.
\nWe now have a simgrid-mc
executable for the model-checker process.\nIt must be called explicitly by the user in order to use the\nmodel-checker (similarly to gdb
or other debugging tools):
# Running the raw application:\n./bugged1\n\n# Running the application in GDB:\ngdb --args ./bugged1\n\n# Running the application in valgrind:\nvalgrind ./bugged1\n\n# Running the application in SimgridMC:\nsimgrid-mc ./bugged1\n
\nFor SMPI applications, the --wrapper
argument of smpirun
must be\nused:
# Running the raw application:\nsmpirun \\\n -hostfile hostfile -platform platform.xml \\\n --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \\\n --cfg=network/TCP_gamma:4194304 \\\n -np 4 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \\\n --cfg=contexts/factory:ucontext --cfg=contexts/stack_size:4 \\\n ./dup\n\n# Running the application in GDB:\nsmpirun -wrapper \"gdb --args\" \\\n -hostfile hostfile -platform platform.xml \\\n --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \\\n --cfg=network/TCP_gamma:4194304 \\\n -np 4 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \\\n --cfg=contexts/factory:ucontext --cfg=contexts/stack_size:4 \\\n ./dup\n\n# Running the application in valgrind:\nsmpirun -wrapper \"valgrind\" \\\n -hostfile hostfile -platform platform.xml \\\n --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \\\n --cfg=network/TCP_gamma:4194304 \\\n -np 4 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \\\n --cfg=contexts/factory:ucontext --cfg=contexts/stack_size:4 \\\n ./dup\n\n# Running the application in SimgridMC:\nsmpirun -wrapper \"simgrid-mc\" \\\n -hostfile hostfile -platform platform.xml \\\n --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \\\n --cfg=network/TCP_gamma:4194304 \\\n -np 4 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \\\n --cfg=contexts/factory:ucontext --cfg=contexts/stack_size:4 \\\n ./dup\n
\nUnder the hood, simgrid-mc
sets a a few environment variable for its\nchild process:
SIMGRID_MC
in order to enable the model-checker support in the\nchild/simulated-application process (this triggers the usage of the\ncustom heap for example);SIMGRID_MC_SOCKET_FD
contains the number of the file descriptor\nused to pass the UNIX datagram socket;LD_BIND_NOW
in order to avoid lazy relocations.After implementing the separate mode, the single process mode has been\nremoved in order to have a cleaner code. In order to have the two\nmode of operations coexist, many functions were checking the mode\noperation and the behaviour was changing depending on the mode. Most\nof this code has been removed and is now much simpler.
\nThe code managing the two heaps is now useless and has been completely\nremoved. We are still using our custom heap implementation in the\nmodel-checked application however: we are using its internal\nrepresentation to track the different allocations in the heap; it is\nused as well in order to clear the bytes of an allocation before\ngiving it to the application. The model-checked application however\nis a quite standard application and uses the standard system heap\nimplementation (or could use another implementation) which is expected\nto have better performance than our implementation.
\nCurrently, it is not quite clear which part of the API are intended to\nbe used by the model-checked process, which part are to be used by the\nmodel-checker process and which parts can be used by both parts. Some\neffort has been used to separate the different parts of the API (by\nmoving them in different header files) but this is is still an ongoing\nprocess. In the future, we might want to have a better organisation\nusing different header files, namespaces and possibly different\nshared-objects for the different parts of the API.
\nA longer term goal, would be to have a nice API for the model-checker\nwhich could easily be used by the users to write their own\nmodel-checker algorithms (in their own executables). We might even\nwant to export a Lua based binding to write the model-checker\nalgorithms.
\nIn parallel, the model-checker code has been ported to C++ and a part\nof it has been rewritten in a more idiomatic C++:
\nvoid*
/XBT-based\ncontainers;simgrid::mc
namespace.All the MC code has been converted to C++ but the conversion to\nidiomatic C++ is still ongoing: some parts of the code are still using\nC idioms.
\nThis first version is quite slower than the previous one. It was\nexpected that the new implementation would be slower than the previous\none because it uses cross-process communications and the old version\nhad been heavily optimised.\nHowever this might be optimised in the future in order to minimise the\noverhead of cross process synchronisations.
\nThis is a first step towards a cleaner and simpler SimGridMC. The heap\njuggling code has been removed. Instead however, we have some code\nwhich reads directly in the data structures of the other process: this\ncode is not so nice and not so maintainable and we will probably want\nto find a better way to do this.
\nSome things still need to be done:
\nIn an attempt to simplify the development around the SimGrid\nmodel-checker, we were thinking about moving the model-checker out in\na different process. Another different approach would be to use a\ndynamic-linker isolation of the different components of the process.\nHere is a summary of the goals, problems and design issues surrounding\nthese topics.
\nThe design if the SimGrid simulator is based on the design of a\noperating system.
\nIn a typical OS, we have a kernel managing a global state and a\nseveral userspace processes running on top of the kernel. The kernel\nschedules the execution of the different processes (and their\nthreads) on the available CPUs. The kernel provides an API to the\nprocesses made of several system calls.
\n\nSimGrid simulated a distributed system: it simulates a network and let\nthe different processes of the simulated system use this simulated\nnetwork. Each simulated process runs on top of the SimGrid kernel.\nThe SimGrid kernel schedules the execution of the different processes\non the available OS threads. The SimGrid kernel provides an API to the\nprocesses made of several simulation calls.
\n\nIn order to reduce the cost of context switching between the different\nprocesses, in the current implementation of SimGrid all the simulated\nprocesses and the SimGrid kernel are in the same OS process: there is\nno MMU-enforced separation of memory between the simulated processes\nbut they are expected to only communicate between each other using\nonly the means provided by the SimGrid kernel (the simulation calls)\nand should not share mutable memory.
\n\nThe SimGrid kernel has a dedicated stack and each simulated process has its\nown stack: cooperative multitasking (fibers, ucontext
) is used to\nswitch between the different contexts (SimGrid kernel/process) and is\nused by the SimGrid kernel to schedule the execution of the different\nprocesses.
The same (libc
) heap is shared between the SimGrid kernel and the\nsimulated processes.
The SimGrid model-checker is a dynamic analysis component for SimGrid.\nIt explores the different possible interleavings of execution of the\nsimulated processes (depending on the execution of their transitions\ni.e. the different possible orderings of their communications).
\nIn order to do this, the MC saves at each node of the graph of the\npossible executions the state of the system:
\nThose states are then used to:
\nIn the current implementation, the model-checker lives in the same\nprocess as the main SimGrid process (the SimGrid kernel and the\nprocesses):
\n\nHowever, the model-checker needs to maintain its own\nstate: the state of the model-checker must not be saved, compared and\nrestored with the rest of the state.
\nIn order to do this, the state of the model-checker is maintained in a\nsecond heap:
\nThis is implemented by overriding the malloc()
, free()
and friends\nin order to support multiple heap. A global variable is used to choose\nthe current working heap:
// Simplified code\nxbt_mheap_t __mmalloc_current_heap = NULL;\n\nvoid *malloc(size_t n)\n{\n return mmalloc(__mmalloc_current_heap, n);\n}\n\nvoid free(void *ptr)\n{\n return mfree(__mmalloc_current_heap, ptr);\n}\n
\nThe current implementation is complicated and not easy to understand and\nmaintain:
\nstatic
variable as a\ncache in the model-checker, sharing data bewteen the model-checker\nand the simulated process.mmalloc()
implementation. This implementation is probably not as\nefficient as a more modern malloc()
implementation: for example,\nit does not use a per-thread arena or any sort of thread-friendly\napproach but a single mutex per heap. Avoiding to have multiple\nheaps per process would remove a dependency on the malloc()
\nimplementation (we would still have a dependency on the mmalloc()
\nmetadata format) and would make it easier to switch to another\nmalloc()
implementation.A first motivation for modifying the architecture of SimGridMC, is to incraase\nthe maintainability of the SimGridMC codebase.
\nAnother related goal is to simplify the debugging experience (of the simulated\napplication, the SimGrid kernel and the model-checker). For example, the current\nversion of SimGridMC does not work under valgrind. A solution which would\nprovide a more powerful debugging experience would be a valuable tool for the\nSimGridMC devs but more importantly for the users of SimGridMC.
\nFor all these reasons, we would like to move the model-checker in a\nseparate process: a model-checker process maintains the model-checker\nstate and control the execution of a model-checked process.
\n\nThe snapshoting/restoration of the model-checked process memory can be\ndone using /proc/${pid}/mem
or process_vm_readv()
and\nprocess_vm_writev()
.
As long as the OS threads are living on stacks which are not managed\nby the state snapshot/restoration mechanism, they will not be\naffected: we must take care that the OS threads switch to unmanaged\nstacks when we are doing the state snapshots/restorations.
\nAnother solution would be to use ptrace()
with PTRACE_GETREGSET
\nand PTRACE_SETREGSET
in order to snapshot/restore the registers of\neach thread but we would like to avoid this in order to be able to use\nptrace()
for debugging or other\npurposes.
Linux does not provide a way to change the file descriptors of another process:\nthe restoration of the file descriptors must be done in the taret OS process\nand cannot be done from the model-checker process. Cooperation of the model-checked\nprocess is needed for the file descriptors restoration.
\nWe could abuse ptrace()
-based syscall rewriting techniques or some\nsort of parasite injection in order to\nachieve this.
Another idea would be to create a custom dynamic linker with namespace\nsupport in order to be able to link multiple instances of the same\nlibrary and provide isolation between different parts of the process.
\nThis could be used to:
\nmmap()
based SMPI\nprivatisation technique;libc
of the model-checker and the main application\nin order to have a cleaner separation of the two heaps;ptrace()
-based system call\ninterception.dlmopen()
It turns out that\nDCE\n(Direct Code Execution)\nalready uses a similar approach to load multiples application instances along\nwith Linux kernel implementations (and its network stack)\non top of the NS3 packet level network simulator\nin the same process:\nthe applications and Linux kernel are compiled as shared objects, the latter\nforming a Library OS liblinux.so
shared object\nand loaded multiple times in the same process alongside with the NS3 instance.
Among several alternative\nstrategies,\nDCE uses the dlmopen()
\nfunction. This is a variant of\ndlopen()
originating from\nSunOS/Solaris and\nimplemented on the GNU userland which allows to load dynamic libraries in\nseparated namespaces:
An alternative implementation of the ld.so
dynamic linker,\nelf-loader
, is used which\nprovides additional\nfeatures:
More information about dlmopen()
\ncan be found in old version of Sun\nLinkers and Libraries Guide.
libc
However, I was envisioning something slightly different: instead of\nwriting a replacement of ld.so
(using raw system calls), I was\nthinking about building the custom dynamic linker on top of libc
and\nlibdl
in order to be able to use libc
(malloc()
), libdl
and\nlibelf
instead of using the raw system calls.
In a split process design, the model-checker could be a quite standard\napplication avoiding weird hacks (such as introspection with /proc/self/maps
and\nDWARF, snapshoting/restoration of the state with memcpy()
, custom mmalloc()
\nimplementation with multiple heaps). Once a relevant trajectory of the\nmodel-checked application has been identified, it could be replayed outside of\nthe model-checker and debugged in this simpler mode.
However, having a single process could lead to a better debugging experience:\nby being able to combines breakpoints in the model-checker, the SimGrid kernel\nand the simulated application with conditions spanning all those components.
\nAt the same time,\nusing multiple dynamic-linking namespaces could make the debugging\nexperience more complicated. I am not sure how well it is supported by the\ndifferent available debugging tools. The DCE tools seems to show that it is\nreasonably well supported by\nGDB\nand\nvalgrind.
\nSo we have two possible directions:
\nelf-loader
;libc
.\n(This is quite a large undertaking but being able to use shared libraries\ncould simplify its code base.)The first solution provides a better isolation of the model-checker.\nThe second solution is closer to the current implementation and\nshould have better performances by avoiding the context switches and\nIPC in favour of direct memory access and function calls. Moreover, the\ndynamic-linker-based isolation could be reused for other parts of the\nprojects (such as the isolation of the simulated MPI processes).
\nIt is not clear which solution would provide the better debugging experience for\nthe user and which solution would be better for the maintainability of\nSimGridMC.
\ndlmopen()
quick demoThis simple program creates three new namespaces and loads libpthread
in those\nnamespaces:
#define _GNU_SOURCE\n#include <dlfcn.h>\n\n#include <unistd.h>\n\nint main(int argc, const char** argv)\n{\n size_t i;\n for (i=0; i!=3; ++i) {\n void* x = dlmopen(LM_ID_NEWLM, \"libpthread.so.0\", RTLD_NOW);\n if (!x)\n return 1;\n }\n while(1) sleep(200000);\n return 0;\n}\n
\nWe see that libpthread
is loaded thrice. Each instance has its own libc
\ninstance as well (and a fourth one is loaded for the main program):
\n00400000-00401000 r-xp 00000000 08:06 7603474 /home/myself/temp/a.out\n00600000-00601000 rw-p 00000000 08:06 7603474 /home/myself/temp/a.out\n0173a000-0175b000 rw-p 00000000 00:00 0 [heap]\n7fca7ac7d000-7fca7ae1c000 r-xp 00000000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7ae1c000-7fca7b01c000 ---p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b01c000-7fca7b020000 r--p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b020000-7fca7b022000 rw-p 001a3000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b022000-7fca7b026000 rw-p 00000000 00:00 0\n7fca7b026000-7fca7b03e000 r-xp 00000000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b03e000-7fca7b23d000 ---p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b23d000-7fca7b23e000 r--p 00017000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b23e000-7fca7b23f000 rw-p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b23f000-7fca7b243000 rw-p 00000000 00:00 0\n7fca7b243000-7fca7b3e2000 r-xp 00000000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b3e2000-7fca7b5e2000 ---p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b5e2000-7fca7b5e6000 r--p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b5e6000-7fca7b5e8000 rw-p 001a3000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b5e8000-7fca7b5ec000 rw-p 00000000 00:00 0\n7fca7b5ec000-7fca7b604000 r-xp 00000000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b604000-7fca7b803000 ---p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b803000-7fca7b804000 r--p 00017000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b804000-7fca7b805000 rw-p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b805000-7fca7b809000 rw-p 00000000 00:00 0\n7fca7b809000-7fca7b9a8000 r-xp 00000000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b9a8000-7fca7bba8000 ---p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7bba8000-7fca7bbac000 r--p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7bbac000-7fca7bbae000 rw-p 001a3000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7bbae000-7fca7bbb2000 rw-p 00000000 00:00 0\n7fca7bbb2000-7fca7bbca000 r-xp 00000000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7bbca000-7fca7bdc9000 ---p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7bdc9000-7fca7bdca000 r--p 00017000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7bdca000-7fca7bdcb000 rw-p 00018000 08:01 2625992 /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7bdcb000-7fca7bdcf000 rw-p 00000000 00:00 0\n7fca7bdcf000-7fca7bf6e000 r-xp 00000000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7bf6e000-7fca7c16e000 ---p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7c16e000-7fca7c172000 r--p 0019f000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7c172000-7fca7c174000 rw-p 001a3000 08:01 2626010 /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7c174000-7fca7c178000 rw-p 00000000 00:00 0\n7fca7c178000-7fca7c17b000 r-xp 00000000 08:01 2626017 /lib/x86_64-linux-gnu/libdl-2.19.so\n7fca7c17b000-7fca7c37a000 ---p 00003000 08:01 2626017 /lib/x86_64-linux-gnu/libdl-2.19.so\n7fca7c37a000-7fca7c37b000 r--p 00002000 08:01 2626017 /lib/x86_64-linux-gnu/libdl-2.19.so\n7fca7c37b000-7fca7c37c000 rw-p 00003000 08:01 2626017 /lib/x86_64-linux-gnu/libdl-2.19.so\n7fca7c37c000-7fca7c39c000 r-xp 00000000 08:01 2625993 /lib/x86_64-linux-gnu/ld-2.19.so\n7fca7c568000-7fca7c56b000 rw-p 00000000 00:00 0\n7fca7c59a000-7fca7c59c000 rw-p 00000000 00:00 0\n7fca7c59c000-7fca7c59d000 r--p 00020000 08:01 2625993 /lib/x86_64-linux-gnu/ld-2.19.so\n7fca7c59d000-7fca7c59e000 rw-p 00021000 08:01 2625993 /lib/x86_64-linux-gnu/ld-2.19.so\n7fca7c59e000-7fca7c59f000 rw-p 00000000 00:00 0\n7fffa8481000-7fffa84a2000 rw-p 00000000 00:00 0 [stack]\n7fffa85f5000-7fffa85f7000 r-xp 00000000 00:00 0 [vdso]\n7fffa85f7000-7fffa85f9000 r--p 00000000 00:00 0 [vvar]\nffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]\n\n
The new namespaces are probably not fully functional in this state:\nthere are probably conflicts to solve in the different instances. For example,\neach libc
probably tries to manage the same heap with sbrk()
.
In two previous posts, I looked into cleaning the stack frame of a\nfunction before using it by adding assembly at the beginning of each\nfunction. This was done either by modifying LLVM with a custom\ncodegen pass or by\nrewriting the\nassembly\nbetween the compiler and the assembler. The current implementation\nadds a loop at the beginning of every function. We look at the impact\nof this modification on the performance on the application.
\nUpdate: this is an updated version of the post with fixed\ncode and updated results (the original version of the code was\nbroken).
\nHere are the initial results:
\nTest | \nNormal | \nStack cleaning | \n
---|---|---|
ctest (complete testsuite) | \n348.06s | \n387.53s | \n
ctest -R mc-bugged1-liveness-visited-ucontext-sparse | \n1.53s | \n2.00s | \n
run_test comm dup 4 | \n42.54s | \n127.80s | \n
On big problems, the overhead of the stack-cleaning modification\nbecomes very important.
\nWe would like to avoid the overhead of the stack-cleaning code. In order\nto do this we can use the following facts:
\nThus, we can disable stack-cleaning if we detect that we are not\nexecuting the application code. This can be implemented in two ways:
\n%rsp
).In order to evaluate, the efficiency of this approach, we use a simple\ncomparison of %rsp
with a constant value:
\tmovq $0x7fff00000000, %r11\n\tcmpq %r11, %rsp\n\tjae .Lstack_cleaner_done0\n\tmovabsq $3, %r11\n.Lstack_cleaner_loop0:\n\tmovq $0, -32(%rsp,%r11,8)\n\tsubq $1, %r11\n\tjne .Lstack_cleaner_loop0\n.Lstack_cleaner_done0:\n\t# Main code of the function goes here\n
\nThe value is hardcoded in this prototype but it could be loaded from a\nglobal variable instead.
\nHere are the results with this optimisation:
\nTest | \nNormal | \nStack cleaning | \n
---|---|---|
ctest (complete testsuite) | \n348.06s | \n372.95s | \n
ctest -R mc-bugged1-liveness-visited-ucontext-sparse | \n1.53s | \n1.53s | \n
run_test comm dup 4 | \n42.54s | \n36.68s | \n
Those results were generated with:
\nMAKEFLAGS=\"-j$(nproc)\"\n\ngit clone https://gforge.inria.fr/git/simgrid/simgrid.git\ngit checkout cd84ed2b393b564f5d8bfdaae60b814f81f24dc4\ncd simgrid\nsimgrid=\"$(pwd)\"\n\nmkdir build-normal\ncd build-normal\ncmake .. -Denable_model-checking=ON -Denable_documentation=OFF \\\n -Denable_compile_warnings=ON -Denable_smpi_MPICH3_testsuite=ON\nmake $MAKEFLAGS\ncd ..\n\nmkdir build-zero\ncd build-zero\ncmake .. -Denable_model-checking=ON -Denable_documentation=OFF \\\n -Denable_compile_warnings=ON -Denable_smpi_MPICH3_testsuite=ON \\\n -DCMAKE_C_COMPILER=\"$simgrid/tools/stack-cleaner/cc\" \\\n -DCMAKE_CXX_COMPILER=\"$simgrid/tools/stack-cleaner/c++\" \\\n -DGFORTRAN_EXE=\"$simgrid/tools/stack-cleaner/fortran\"\nmake $MAKEFLAGS\ncd ..\n\nrun_test() {\n (\n platform=$(find $simgrid -name small_platform_with_routers.xml)\n hostfile=$(find $simgrid | grep mpich3-test/hostfile$)\n\n local base\n base=$(pwd)\n cd $base/teshsuite/smpi/mpich3-test/$1/\n\n $base/bin/smpirun -hostfile $hostfile -platform $platform \\\n --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \\\n --cfg=network/TCP_gamma:4194304 \\\n -np $3 --cfg=model-check:1 \\\n --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \\\n --cfg=contexts/factory:ucontext --cfg=model-check/max_depth:100000 \\\n --cfg=model-check/reduction:none --cfg=model-check/visited:100000 \\\n --cfg=contexts/stack_size:4 --cfg=model-check/sparse-checkpoint:yes \\\n --cfg=model-check/soft-dirty:no ./$2 > /dev/null\n )\n}\n
\nThe results without the optimisation are obtained by removing the\nrelevant assembly from the clean-stack-filter
script.
In order to help the SimGridMC state comparison code, I wrote a\nproof-of-concept LLVM pass which cleans each stack\nframe before using\nit. However, SimGridMC currently does not work properly when compiled\nwith clang/LLVM. We can do the same thing by pre-processing the\nassembly generated by the compiler before passing it to the linker:\nthis is done by inserting a script between the compiler and the\nassembler. This script will rewrite the generated assembly by\nprepending stack-cleaning code at the beginning of each function.
\nIn typical compilation process, the compiler (here cc1
) reads the\ninput source file and generates assembly. This assembly is then passed\nto the assembler (as
) which generates native binary code:
cat foo.c | cc1 | as > foo.o\n# \u2191 \u2191 \u2191\n# Source Assembly Native\n
\nWe can achieve our goal without depending of LLVM by adding a simple\nassembly-rewriting script to this pipeline between the the compiler\nand the assembler:
\ncat foo.c | cc1 | clean-stack-filter | as > foo.o\n# \u2191 \u2191 \u2191 \u2191\n# Source Assembly Assembly Native\n
\nBy doing this, our modification can be used for any compiler as long\nas it sends assembly to an external assembler instead of generating\nthe native binary code directly.
\nThis will be done in three components:
\nclean-stack-filter
);as
) wrapper which calls the assembly rewriting\nscript before delegating to the real assembler;cc
) which calls the real compiler program and\nconfigure it in order to call our assembler wrapper.The first step is to write a simple UNIX program taking in input the\nassembly code of a source file and adding in output a stack-cleaning\npre-prolog.
\nHere is the generated assembly for the test function of the previous\nepisode (compiled with GCC):
\nmain:\n.LFB0:\n\t.cfi_startproc\n\tsubq\t$40, %rsp\n\t.cfi_def_cfa_offset 48\n\tmovl\t%edi, 12(%rsp)\n\tmovq\t%rsi, (%rsp)\n\tmovl\t$42, 28(%rsp)\n\tmovl\t$0, %eax\n\tcall\tf\n\tmovl\t$0, %eax\n\taddq\t$40, %rsp\n\t.cfi_def_cfa_offset 8\n\tret\n\t.cfi_endproc\n
\nWe can use .cfi_startproc
to find the beginning of a function and\neach pushq
and subq $x, %rsp
instruction to estimate the stack\nsize used by this function (excluding the red zone and alloca()
as\npreviously). Each time we are seeing the beginning of a function we\nneed to buffer each line until we are ready to emit the stack-cleaning\ncode.
#!/usr/bin/perl -w\n# Transform assembly in order to clean each stack frame for X86_64.\n\nuse strict;\n$SIG{__WARN__} = sub { die @_ };\n\n# Whether we are still scanning the content of a function:\nour $scanproc = 0;\n\n# Save lines of the function:\nour $lines = \"\";\n\n# Size of the stack for this function:\nour $size = 0;\n\n# Counter for assigning unique ids to labels:\nour $id=0;\n\nsub emit_code {\n my $qsize = $size / 8;\n my $offset = - $size - 8;\n\n if($size != 0) {\n print(\"\\tmovabsq \\$$qsize, %r11\\n\");\n print(\".Lstack_cleaner_loop$id:\\n\");\n print(\"\\tmovq \\$0, $offset(%rsp,%r11,8)\\n\");\n print(\"\\tsubq \\$1, %r11\\n\");\n print(\"\\tjne .Lstack_cleaner_loop$id\\n\");\n }\n\n print $lines;\n\n $id = $id + 1;\n $size = 0;\n $lines = \"\";\n $scanproc = 0;\n}\n\nwhile (<>) {\n if ($scanproc) {\n $lines = $lines . $_;\n if (m/^[ \\t]*.cfi_endproc$/) {\n\t emit_code();\n } elsif (m/^[ \\t]*pushq/) {\n\t $size += 8;\n } elsif (m/^[ \\t]*subq[\\t *]\\$([0-9]*),[ \\t]*%rsp$/) {\n my $val = $1;\n $val = oct($val) if $val =~ /^0/;\n $size += $val;\n emit_code();\n }\n } elsif (m/^[ \\t]*.cfi_startproc$/) {\n print $_;\n\n $scanproc = 1;\n } else {\n print $_;\n }\n}\n
\nThis is used as:
\n# Use either of:\nclean-stack-filter < helloworld.s\ngcc -o- -S hellworld.c | clean-stack-filter | gcc -x assembler -r -o helloworld\n
\nAnd this produces:
\nmain:\n.LFB0:\n\t.cfi_startproc\n\tmovabsq $5, %r11\n.Lstack_cleaner_loop0:\n\tmovq $0, -48(%rsp,%r11,8)\n\tsubq $1, %r11\n\tjne .Lstack_cleaner_loop0\n\tsubq\t$40, %rsp\n\t.cfi_def_cfa_offset 48\n\tmovl\t%edi, 12(%rsp)\n\tmovq\t%rsi, (%rsp)\n\tmovl\t$42, 28(%rsp)\n\tmovl\t$0, %eax\n\tcall\tf\n\tmovl\t$0, %eax\n\taddq\t$40, %rsp\n\t.cfi_def_cfa_offset 8\n\tret\n\t.cfi_endproc\n
\nA second step is to write an extended assembler as
program which\naccepts an extra argument --filter my_shell_command
. We could\nhardcode the filtering script in this wrapper but a generic assembler\nwrapper might be reused somewhere else.
We need to:
\ninterpret a part of the as
command-line arguments and our extra\nargument;
apply the specified filter on the input assembly;
\npass the resulting assembly to the real assembler.
\n#!/usr/bin/ruby\n# Wrapper around the real `as` which adds filtering capabilities.\n\nrequire \"tempfile\"\nrequire \"fileutils\"\n\ndef wrapped_as(argv)\n\n args=[]\n input=nil\n as=\"as\"\n filter=\"cat\"\n\n i = 0\n while i<argv.size\n case argv[i]\n \n when \"--as\"\n as = argv[i+1]\n i = i + 1\n when \"--filter\"\n filter = argv[i+1]\n i = i + 1\n\n when \"-o\", \"-I\"\n args.push(argv[i])\n args.push(argv[i+1])\n i = i + 1\n when /^-/\n args.push(argv[i])\n else\n if input\n exit 1\n else\n input = argv[i]\n end\n end\n i = i + 1\n end\n\n if input==nil\n # We dont handle pipe yet:\n exit 1\n end\n\n # Generate temp file\n tempfile = Tempfile.new(\"as-filter\")\n unless system(filter, 0 => input, 1 => tempfile)\n status=$?.exitstatus\n FileUtils.rm tempfile\n exit status\n end\n args.push(tempfile.path)\n\n # Call the real assembler:\n res = system(as, *args)\n status = if res != nil\n $?.exitstatus\n else\n 1\n end\n FileUtils.rm tempfile\n exit status\n \nend\n\nwrapped_as(ARGV)\n
\nThis is used like this:
\ntools/as --filter \"sed s/world/abcde/\" helloworld.s\n
\nWe now can ask the compiler to use our assembler wrapper instead of\nthe real system assembler:
\n-B
switch prepend a directory to the list of directories used\nto find subprograms such as as
;-no-integrated-as
flag forces the compiler to pass\nthe generated assembly to an external assembler instead of\ngenerating native binary code directly.gcc -B tools/ -Wa,--filter,'sed s/world/abcde/' \\\n helloworld.c -o helloworld-modified-gcc\n
\nclang -no-integrated-as -B tools/ -Wa,--filter,'sed s/world/abcde/' \\\n helloworld.c -o helloworld-modified-clang\n
\nWhich produces:
\n$ ./helloworld\nHello world!\n$ ./helloworld-modified-gcc\nHello abcde!\n$ ./helloworld-modified-clang\nHello abcde!\n
\nBy combining the two tools, we can get a compiler with stack-cleaning enabled:
\ngcc -B tools/ -Wa,--filter,'stack-cleaning-filter' \\\n helloworld.c -o helloworld\n
\nNow we can write compiler wrappers which do this job automatically:
\n#!/bin/sh\npath=(dirname $0)\nexec gcc -B $path -Wa,--filter,\"$path\"/clean-stack-filter \"$@\"\n
\n#!/bin/sh\npath=(dirname $0)\nexec g++ -B $path -Wa,--filter,\"$path\"/clean-stack-filter \"$@\"\n
\nWarning
\nAs the assembly modification is implemented in as
,\nthis compiler wrapper will output the unmodified assembly when using\ncc -S
which be surprising. You need to objdump
the .o
file in\norder to see the effect of the filter.
The whole test suite of SimGrid with model-checking works with this\nimplementation. The next step is to see the impact of this\nmodification on the state comparison of SimGridMC.
\n"}, {"id": "http://www.gabriel.urdhr.fr/2014/10/06/cleaning-the-stack-in-a-llvm-pass/", "title": "Cleaning the stack in a LLVM pass", "url": "https://www.gabriel.urdhr.fr/2014/10/06/cleaning-the-stack-in-a-llvm-pass/", "date_published": "2014-10-06T10:00:02+02:00", "date_modified": "2014-10-06T10:00:02+02:00", "tags": ["computer", "simgrid", "llvm", "compilation", "assembly", "x86_64"], "content_html": "In the previous episode, we implemented a LLVM pass which does\nnothing. Now we are trying to modify\nthis to create a (proof-of-concept) LLVM pass which fills the current\nstack frame with zero before using it.
\n\nThe top (in fact the bottom) of the stack is stored in the %rsp
\nregister: a push
operation decrements the value of %rsp
and store\nthe value in the resulting address; conversely a pop
operation\nincrements the value of %rsp
. Stack variables are allocated by\ndecrementing %rsp
.
A function call (call
) pushes the current value of the instruction\n(%rip
) pointer on the stack. A return instruction (ret
) pops a\nvalue from the stack into %rip
.
A typical call frame contains in order:
\nFor example this C code,
\nint f();\n\nint main(int argc, char** argv) {\n int i = 42;\n f();\n return 0;\n}\n
\nis compiled (with clang -S -fomit-frame-poiner example.c
) into this\n(using AT&T\nsyntax):
main:\n\tsubq\t$24, %rsp\n\tmovl\t$0, 20(%rsp)\n\tmovl\t%edi, 16(%rsp)\n\tmovq\t%rsi, 8(%rsp)\n\tmovl\t$42, 4(%rsp)\n\tmovb\t$0, %al\n\tcallq\tf\n\tmovl\t$0, %edi\n\tmovl\t%eax, (%rsp)\n\tmovl\t%edi, %eax\n\taddq\t$24, %rsp\n\tret\n
\nMemory is allocated on the stack using subq
. Local variables are\nusually referenced by offsets from the stack pointer, OFFSET(%rsp)
.
The x86 (32 bit) ABI uses the %rbp
as the base of the stack. This is\nnot mandatory in the x86-64\nABI but the\ncompiler might still use a frame pointer. The base of the stack frame\nin stored in %rbp
.
Here is the same program compiled with -fno-omit-frame-pointer
:
main:\n\tpushq\t%rbp\n\tmovq\t%rsp, %rbp\n\tsubq\t$32, %rsp\n\tmovl\t$0, -4(%rbp)\n\tmovl\t%edi, -8(%rbp)\n\tmovq\t%rsi, -16(%rbp)\n\tmovl\t$42, -20(%rbp)\n\tmovb\t$0, %al\n\tcallq\tf\n\tmovl\t$0, %edi\n\tmovl\t%eax, -24(%rbp)\n\tmovl\t%edi, %eax\n\taddq\t$32, %rsp\n\tpopq\t%rbp\n\tret\n
\nWhen a frame pointer is used, stack memory is usually referenced as\nfixed offset from %rsp
: OFFSET(%rsp)
.
The x86 32-bit ABI did not allow the code of the function to use\nvariables after the top of the stack: a signal handler could at any\nmoment use any memory after the top of the stack.
\nThe standard x86-64\nABI allows the\ncode of the current function to use the 128 bytes (the red zone) after\nthe top the stack. A signal handler must be instantiated by the OS\nafter the red zone. The red zone can be used for temporary variables\nor for local variables for leaf functions (functions which do not call\nother functions).
\n\nNote: Windows systems do not use the standard x86-64 ABI: the\nusage of the register is different and there is no red zone.
\nLet's make main()
a leaf function:
int main(int argc, char** argv) {\n int i = 42;\n return 0;\n}\n
\nThe variables are allocated in the red zone (negative offsets from the\nstack pointer):
\nmain:\n movl $0, %eax\n movl $0, -4(%rsp)\n movl %edi, -8(%rsp)\n movq %rsi, -16(%rsp)\n movl $42, -20(%rsp)\n ret\n
\nHere is the code we are going to add at the beginning of each\nfunction:
\n\tmovq $QSIZE, %r11\n.Lloop:\n movq $0, OFFSET(%rsp,%r11,8)\n subq $1, %r11\n jne .Lloop\n
\nfor some suitable values of QSIZE and OFFSET.
\nThe %r11
is defined by the System V x86-64 ABI (as well as the\nWindows ABI) as a scratchpad register: at the beginning of the\nfunction we are free to use it without saving it first.
This is implemented by a StackCleaner
machine pass whose\nrunOnMachineFunction()
works similarly to the NopInserter
pass.
We compute the parameters of the generate native code from the size of\nthe stack frame:
\nfn.getFrameInfo()->getStackSize()
is the size of the stack used\nby this function (excluding the red zone);X86FrameLowering.cpp
) and SimGridMC does not analyse the stack of\nleaf functions (we would just have to add 128 to size
in order to\nclean up the red zone as well);alloca()
) are not counted here.int size = fn.getFrameInfo()->getStackSize();\nint qsize = size / sizeof(uint64_t);\nif (size==0) {\n // No stack to clean, we do not modify the function:\n return false;\n}\nint offset = - size - sizeof(uint64_t);\n
\nFor LLVM, a functions is represented as a collection\nof basic\nblocks. A basic block is a sequence of instructions where:
\nOur assembly snippet is made of two basic blocks:
\nMachineBasicBlock* bb0 = fn.begin();\nMachineBasicBlock* bb1 = fn.CreateMachineBasicBlock();\nMachineBasicBlock* bb2 = fn.CreateMachineBasicBlock();\n\nfn.push_front(bb2);\nfn.push_front(bb1);\n
\nA functions is a Control Flow Graph of basic blocks. We need to\ncomplete the arcs in this graph:
\nbb1->addSuccessor(bb1);\nbb2->addSuccessor(bb2);\nbb2->addSuccessor(bb0);\n
\nWe generate the machine instructions:
\n// First basic block (initialisation):\n\n// movq $QSIZE, %r11\nllvm::BuildMI(*bb1, bb1->end(), llvm::DebugLoc(), TII.get(llvm::X86::MOV64ri),\n X86::R11).addImm(qsize);\n\n// Second basic block (.Lloop):\n\n// movq $0, OFFSET(%rsp,%r11,8)\nllvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::MOV64mi32))\n .addReg(X86::RSP).addImm(8).addReg(X86::R11).addImm(offset).addReg(0)\n .addImm(0);\n\n// subq $1, %r11\nllvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::SUB64ri8),\n X86::R11)\n .addReg(X86::R11)\n .addImm(1);\n\n// jne .Lloop\nllvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::JNE_4))\n .addMBB(bb2);\n
\nThe instructions have suffix on the argument size and types:
\n64
for instructions working on 64-bit values;r
for register;i
for immediate;i
for memory.The function has been modified:
\nreturn true;\n
\nHere is the generated assembly for our test code:
\nmain:\n\tmovabsq\t$3, %r11\n.LBB0_1:\n\tmovq\t$0, -32(%rsp,%r11,8)\n\tsubq\t$1, %r11\n\tjne\t.LBB0_1\n\tsubq\t$24, %rsp\n\tmovl\t$0, 20(%rsp)\n\tmovl\t%edi, 16(%rsp)\n\tmovq\t%rsi, 8(%rsp)\n\tmovl\t$42, 4(%rsp)\n\tmovb\t$0, %al\n\tcallq\tf\n\tmovl\t$0, %edi\n\tmovl\t%eax, (%rsp)\n\tmovl\t%edi, %eax\n\taddq\t$24, %rsp\n\tretq\n
\nHere is a simple test program using unitialized stack variables:
\n#include <stdio.h>\n\nvoid f() {\n int i;\n int data[16];\n\n for(i=0; i!=16; ++i)\n printf(\"%i \", data[i]);\n printf(\"\\n\");\n\n for(i=0; i!=16; ++i)\n data[i] = i;\n}\n\nvoid g() {\n int i, j, k, l, m, n, o, p;\n printf(\"%i %i %i %i %i %i %i %i\\n\", i, j, k, l, m, n, o, p);\n}\n\nint main(int argc, char** argv) {\n f();\n f();\n g();\n return 0;\n}\n
\nThis is the output of a normal compilation:
\n-1 0 -812203224 32767 -406470232 32655 -400476992 32655 -400465496 32655 0 0 1 0 4195997 0\n0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\n16 0 0 15774463 15 14 13 12\n\n
And with our stack-cleaning clang:
\n0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n0 0 0 0 0 0 0 0\n\n
The whole SimGrid test suite works without compiling SimgridMC\nsupport.
\nAt this point, I discovered that SimGrid fails to run when compiled\nwith clang (or DragonEgg) with support for SimGridMC. I need to fix\nthis first before testing the impact of cleaning the stack on\nSimGridMC state comparison.
\nIn the next episode, I'll try another implementation of the same\nconcept using a few scripts in order to process the generated\nassembly between the compiler and the\nassembler\nwhich should work with a standard GCC and with SimGridMC.
\nThe SimGrid model checker uses memory introspection (of the heap,\nstack and global variables) in order to detect the equality of the\nstate of a distributed application at the different nodes of its\nexecution graph. One difficulty is to deal with uninitialised\nvariables. The uninitialised global variables are usually not a big\nproblem as their initial value is 0. The heap variables are dealt with\nby memset
ing to 0 the content of the buffers returned by malloc
\nand friends. The case of uninitialised stack variables is more\nproblematic as their value is whatever was at this place on the stack\nbefore. In order to evaluate the impact of those uninitialised\nvariables, we would like to clean each stack frame before using\nthem. This could be done with a LLVM plugin. Here is my first attempt\nto write a LLVM pass to modify the code of a function.
A solution for this, would be to include, at compilation time,\ninstructions to clean the stack frame at the beginning of each\nfunction. This could be implemented as a LLVM\npass:
\nThis is mostly relevant when the generated code is not optimised. In\noptimised code, local variables do not need to live on the stack.
\nA good high level introduction to the LLVM architecture (LLVM IR and\npasses) can be found in The Architecture of Open Source\nApplications.
\nLLVM uses an intermediate language, LLVM\nIR to optimise and generate native\ncode.
\nFor example, a simple hello world like this,
\n#include <stdio.h>\n\nint main(int argc, char** argv) {\n puts(\"Hello world!\");\n return 0;\n}\n
\nis turned into this LLVM IR:
\n; ModuleID = 'helloworld.c'\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n@.str = private unnamed_addr constant [13 x i8] c\"Hello world!\\00\", align 1\n\n; Function Attrs: nounwind uwtable\ndefine i32 @main(i32 %argc, i8** %argv) #0 {\n %1 = alloca i32, align 4\n %2 = alloca i32, align 4\n %3 = alloca i8**, align 8\n store i32 0, i32* %1\n store i32 %argc, i32* %2, align 4\n store i8** %argv, i8*** %3, align 8\n %4 = call i32 @puts(i8* getelementptr inbounds ([13 x i8]* @.str, i32 0, i32 0))\n ret i32 0\n}\n\ndeclare i32 @puts(i8*) #1\n\nattributes #0 = { nounwind uwtable \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\nattributes #1 = { \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\n\n!llvm.ident = !{!0}\n\n!0 = metadata !{metadata !\"Debian clang version 3.6.0-svn215195-1 (trunk) (based on LLVM 3.6.0)\"}\n
\nby
\nclang -S -emit-llvm helloworold.c -o helloworld.ll\n
\nThe generated LLVM IR can be target-dependant as the type of the\nvariables may depend on the architecture/OS:
\nint
is mapped into a LLVM i32
on 32-bit, LLP64 and LP64\nsystem but to a i64
on ILP64;long
is mapped into a i32
on 32-bit and LLP64 systems but\nto i64
on LP64 and ILP64.The initial generation of LLVM IR is not done in LLVM but by the\nfrontend (clang, dragonegg, etc.).
\nMany LLVM optimisations are implemented in an architecture independant\nway by IR passes which transform/optimise IR:
\nopt -std-compile-opts -S helloworld.ll -o helloworld.opt.ll --time-passes 2> opt.log\n
\nGenerated IR:
\n; ModuleID = 'helloworld.ll'\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n@.str = private unnamed_addr constant [13 x i8] c\"Hello world!\\00\", align 1\n\n; Function Attrs: nounwind uwtable\ndefine i32 @main(i32 %argc, i8** nocapture readnone %argv) #0 {\n %1 = tail call i32 @puts(i8* getelementptr inbounds ([13 x i8]* @.str, i64 0, i64 0)) #2\n ret i32 0\n}\n\n; Function Attrs: nounwind\ndeclare i32 @puts(i8* nocapture readonly) #1\n\nattributes #0 = { nounwind uwtable \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\nattributes #1 = { nounwind \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\nattributes #2 = { nounwind }\n\n!llvm.ident = !{!0}\n\n!0 = metadata !{metadata !\"Debian clang version 3.6.0-svn215195-1 (trunk) (based on LLVM 3.6.0)\"}\n
\nThis optimized LLVM IR is then used to generate assembly/binary code\nfor the target architecture:
\nllc helloworld.opt.ll -o helloworld.s --time-passes 2> llc.log\n
\nGenerated assembly:
\n .text\n .file \"/home/foo/temp/helloworld.opt.ll\"\n .globl main\n .align 16, 0x90\n .type main,@function\nmain: # @main\n .cfi_startproc\n# BB#0:\n pushq %rbp\n.Ltmp0:\n .cfi_def_cfa_offset 16\n.Ltmp1:\n .cfi_offset %rbp, -16\n movq %rsp, %rbp\n.Ltmp2:\n .cfi_def_cfa_register %rbp\n movl $.L.str, %edi\n callq puts\n xorl %eax, %eax\n popq %rbp\n retq\n.Ltmp3:\n .size main, .Ltmp3-main\n .cfi_endproc\n\n .type .L.str,@object # @.str\n .section .rodata.str1.1,\"aMS\",@progbits,1\n.L.str:\n .asciz \"Hello world!\"\n .size .L.str, 13\n\n\n .ident \"Debian clang version 3.6.0-svn215195-1 (trunk) (based on LLVM 3.6.0)\"\n .section \".note.GNU-stack\",\"\",@progbits\n
\nA LLVM based compiler uses the following\nphases:
\nSteps 1 and 2 are parts of the code of the compiler. Steps 3 and 4 are\nhandled by the LLVM framework (configurable/pluggable by the\ncompiler).
\nAs we want to touch the content of the stack, we want to add a CodeGen\npass.
\nLet's first try to add a pass to insert a NOP into every function.
\nLet's create a new NoopInserter
pass (NoopInserter.h
). There are\nmany kinds of passes. This pass is a MachineFunction
pass: it is\ncalled (runOnMachineFunction
) on each generarated native function\nand can modify it before it is passed to the next pass.
#include <llvm/PassRegistry.h>\n#include <llvm/CodeGen/MachineFunctionPass.h>\n\nnamespace llvm {\n\n class NoopInserter : public llvm::MachineFunctionPass {\n public:\n static char ID;\n NoopInserter();\n virtual bool runOnMachineFunction(llvm::MachineFunction &Fn);\n };\n\n}\n
\nThe ID
is used as a reference to the pass in LLVM: the value of this\nvariable is not important, only its address is used.
#include \"NoopInserter.h\"\n\n#include <llvm/CodeGen/MachineInstrBuilder.h>\n#include <llvm/Target/TargetMachine.h>\n#include <llvm/Target/TargetInstrInfo.h>\n#include <llvm/PassManager.h>\n#include <llvm/Transforms/IPO/PassManagerBuilder.h>\n#include <llvm/CodeGen/Passes.h>\n#include <llvm/Target/TargetSubtargetInfo.h>\n#include \"llvm/Pass.h\"\n\n#define GET_INSTRINFO_ENUM\n#include \"../Target/X86/X86GenInstrInfo.inc\"\n\n#define GET_REGINFO_ENUM\n#include \"../Target/X86/X86GenRegisterInfo.inc.tmp\"\n\nnamespace llvm {\n char NoopInserter::ID = 0;\n\n NoopInserter::NoopInserter() : llvm::MachineFunctionPass(ID) {\n }\n\n bool NoopInserter::runOnMachineFunction(llvm::MachineFunction &fn) {\n const llvm::TargetInstrInfo &TII = *fn.getSubtarget().getInstrInfo();\n MachineBasicBlock& bb = *fn.begin();\n llvm::BuildMI(bb, bb.begin(), llvm::DebugLoc(), TII.get(llvm::X86::NOOP));\n return true;\n }\n\n char& NoopInserterID = NoopInserter::ID;\n}\n\nusing namespace llvm;\n\nINITIALIZE_PASS_BEGIN(NoopInserter, \"noop-inserter\",\n \"Insert a NOOP\", false, false)\nINITIALIZE_PASS_DEPENDENCY(PEI)\nINITIALIZE_PASS_END(NoopInserter, \"noop-inserter\",\n \"Insert a NOOP\", false, false)\n
\nThe runOnMachineFunction
method finds the beginning of the function\nand inserts a X86 NOOP instruction. The method returns true
in order\nto tell the LLVM framework that this function has been modified by\nthis pass. This implementation will only work on X86/AMD64 targets.\nA real pass should be target independent or at least check the target.
The INITIALIZE_PASS
macros declare the pass and declare its\ndependencies. Here, we are declaring a dependency on PEI
a.k.a\nPrologEpilogInserter
which adds the prolog and epilog to the code of\nnative function. Those macros define a function:
void initializeNoopInserterPass(PassRegistry &Registry);\n
\nThe NoopInserterID
may be used by other passes to refer to this\npass.
We have to add a few declarations of this pass.
\nIn include/llvm/CodeGen/Passes.h
:
// NoopInserter - This pass inserts a NOOP instruction\nextern char &NoopInserterID;\n
\nIn include/llvm/InitializePasses.h
:
void initializeNoopInserterPass(PassRegistry &Registry)\n
\nThe pass must be added in llvm::initializeCodeGen()
\nlib/CodeGen/CodeGen.cpp
:
initializeNoopInserterPass(Registry);\n
\nclang -O3 helloworld.c -S -o-\n
\nWe have a nice NOOP:
\n\t.text\n\t.file\t\"/home/foo/temp/helloworld.c\"\n\t.globl\tmain\n\t.align\t16, 0x90\n\t.type\tmain,@function\nmain: # @main\n\t.cfi_startproc\n# BB#0: # %entry\n\tnop\n\tpushq\t%rax\n.Ltmp0:\n\t.cfi_def_cfa_offset 16\n\tmovl\t$.L.str, %edi\n\tcallq\tputs\n\txorl\t%eax, %eax\n\tpopq\t%rdx\n\tretq\n.Ltmp1:\n\t.size\tmain, .Ltmp1-main\n\t.cfi_endproc\n\n\t.type\t.L.str,@object # @.str\n\t.section\t.rodata.str1.1,\"aMS\",@progbits,1\n.L.str:\n\t.asciz\t\"Hello world!\"\n\t.size\t.L.str, 13\n\n\n\t.ident\t\"clang version 3.6.0 \"\n\t.section\t\".note.GNU-stack\",\"\",@progbits\n
\nThe program still works:
\n$ clang -O3 helloworld.c -S -o-\n$ ./a.out\nHello world!\n
\nI successfully managed to add a pass in order to (actively) do nothing\nin each generated native function. In the next episode, I will try to do\nsomething useful\ninstead.
\n"}, {"id": "http://www.gabriel.urdhr.fr/2014/07/22/same-page-merging/", "title": "Results on same-page-merging snapshots", "url": "https://www.gabriel.urdhr.fr/2014/07/22/same-page-merging/", "date_published": "2014-07-22T00:00:00+02:00", "date_modified": "2014-07-22T00:00:00+02:00", "tags": ["simgrid", "system", "computer", "checkpoint"], "content_html": "In the previous episode, I talked about the\nimplementation of a same-page-merging page store. On top of this, we\ncan build same-page-merging snapshots for the SimGrid model checker.
\nThe next layer on top of the page store, is\na generic logic for saving and restoring a contiguous area of memory\npages:
\n/** @brief Take a per-page snapshot of a region\n *\n * @param data The start of the region (must be at the beginning of a page)\n * @param pag_count Number of pages of the region\n * @param pagemap Linux kernel pagemap values for this region (or NULL)\n * @param reference_pages Snapshot page numbers of the previous mc_soft_dirty_reset() (or NULL)\n * @return Snapshot page numbers of this new snapshot\n */\nmc_mem_region_t region* mc_take_page_snapshot_region(\n void* data, size_t page_count,\n uint64_t* pagemap, size_t* reference_pages);\n\n/** @brief Restore a snapshot of a region\n *\n * If possible, the restoration will be incremental\n * (the modified pages will not be touched).\n *\n* @param start_addr Address of the first page where we have to restore the page\n * @param page_count Number of pages of the region\n * @param pagenos Array of page indices from the global page store\n * @param pagemap Linux kernel pagemap values for this region (or NULL)\n * @param reference_pages Snapshot page numbers of the previous soft_dirty_reset (or NULL)\n */\nvoid mc_restore_page_snapshot_region(\n void* start_ddr, size_t page_count,\n size_t* pagenos,\n uint64_t* pagemap, size_t* reference_pagenos);\n\n/** @brief Free memory of a page store\n */\nvoid mc_free_page_snapshot_region(\n size_t* pagenos, size_t page_count);\n\n/** @brief Reset the soft-dirty bits\n *\n * This is done after checkpointing and after checkpoint restoration\n * (if per page checkpoiting is used) in order to know which pages were\n * modified.\n *\n * See https://www.kernel.org/doc/Documentation/vm/soft-dirty.txt\n * */\nvoid mc_softdirty_reset();\n
\nThe next layer is SimGrid-specific and handles part of the\nsnapshoting logic:
\nmc_softdirty_reset()
\nwhen after takind snapshot or restoring a snapshot;The most invasive part of this modification in the SimGrid codebase is\nthe logic to read data from the snapshots. Without this feature, a\nsimple offset was applied to find the base of a variable in the\nsnapshot: now, a software MMU algorithm must be done. A variable can\nnow be split across different non-contiguous memory pages. The whole\nlogic of reading from snapshots had to me modified to handle this.
\nThose results were obtained with the command:
\n# COMMAND: sendrecv2, mprobe or sendall\n# SPARSE, SOFTDIRTY: yes or no\ncd teshsuite/smpi/mpich3-test/pt2pt/\nexport TIME=\"clock:%e user:%U sys:%S swapped:%W exitval:%x max:%Mk\"\nsetarch x86_64 -R time smpirun -hostfile ../hostfile -platform $(find ../../../.. -name small_platform_with_routers.xml) --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI --cfg=network/TCP_gamma:4194304 -np 4 --cfg=model-check:1 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich --cfg=contexts/factory:ucontext --cfg=model-check/max_depth:100000 --cfg=model-check/reduction:none --cfg=model-check/visited:100000 --cfg=contexts/stack_size:4 --cfg=model-check/sparse-checkpoint:$SPARSE --cfg=model-check/soft-dirty:$SOFTDIRTY $COMMAND\n
\nThey were run on a laptop with quad-core Intel\u00ae Core\u2122 i7-3687U\nCPU @ 2.10GHz with 8GiB of RAM. Note that the memory reported is the\nRSS and does include swapped-out memory.
\nsendrecv2
In this example, we observe a 80% reduction of the memory consumption\nfor a slight slowdown. Using soft-dirty tracking does not have a\npositive impact on the performance: some time is gained in user land\nby avoiding comparing memory pages but the same amount of time is\nspend in kernel space tracking the soft-clean/soft-dirty pages.
\nType | \nclock | \nuser | \nsystem | \nMax. RSS (KiB) | \n
---|---|---|---|---|
Simple snapshot | \n9.96s | \n9.16s | \n0.78s | \n3 332 788 | \n
Same-page-merging snapshot w/o soft-dirty tracking | \n10.02s | \n9.82s | \n0.19s | \n540 420 | \n
Same-page-merging snapshot with soft-dirty tracking | \n10.70s | \n8.86s | \n1.80s | \n540 936 | \n
mprobe
Similar results here:
\nType | \nclock | \nuser | \nsystem | \nMax. RSS (KiB) | \n
---|---|---|---|---|
Simple snapshot | \n13.41s | \n13.00s | \n0.40s | \n1 692 492 | \n
Same-page-merging snapshot w/o soft-dirty tracking | \n14.12s | \n13.89s | \n0.14s | \n414 916 | \n
Same-page-merging snapshot with soft-dirty tracking | \n14.44s | \n13.16s | \n1.25s | \n415 028 | \n
sendflood
In this example, without the same-page-merging snapshot we hit the\nswap limit (the RSS does not include the swapped-out memory). In this\ncase, using same-page-merging snapshot is faster because the process\ndoes not swap. Using soft-dirty tracking does not have a beneficial\nimpact in this case either: a lot of a time is lost marking the pages\nas soft-dirty/soft-clean.
\nType | \nclock | \nuser | \nsystem | \nMax. RSS (KiB) | \n
---|---|---|---|---|
Simple snapshot | \n73.31s | \n56.34s | \n5.26s | \n7 213 956 | \n
Same-page-merging snapshot w/o soft-dirty tracking | \n59.12s | \n56.87s | \n2.22s | \n1 570 312 | \n
Same-page-merging snapshot with soft-dirty tracking | \n82.74s | \n53.71s | \n29.06s | \n1 609 048 | \n
This approach achieves an important reduction of the memory\nconsumption without a significant impact on performance. With this\ntechnique we should be able to handle bigger applications problem,\nsave more states of the application. Those tests were run on\napplications where a lot of pages change between snapshots. On\napplications where many pages are not modified, the reduction of\nmemory consumption should be much more bigger.
\nSoft-dirty tracking does not seem to be very efficient in our\ntests. It might be useful if the applications is swapping by avoiding\nto swap when taking a snapshot. This feature will probably be disabled\nby default and might be removed in the future.
\nIt should be possible to increase the efficiency of the method by\nincreasing page sharing:
\nfree()
);It should be possible to speed up the process by\u00a0:
\nWe used the granularity of the memory page but it is not strictly\nnecessary. We might use a finer granularity in order to increase the\nsharing between snapshots. The granularity (the size of the chunks)\nshould be regular and a power of 2 (in order to be able to apply the\nMMU algorithm). However, the memory overhead would be greater (index\nof the
The first (lower) layer of the per-page snapshot mechanism is a page\nstore: its responsibility is to store immutable shareable\nreference-counted memory pages independently of the snapshoting\nlogic. Snapshot management and representation, soft-dirty tracking\nwill be handled in higher layer.
\nclass s_mc_pages_store {\n\n typedef uint64_t hash_type;\n typedef boost::unordered_set<size_t> page_set_type;\n typedef boost::unordered_map<hash_type, page_set_type> pages_map_type;\n\n void* memory_;\n size_t capacity_;\n size_t top_index_;\n std::vector<uint64_t> page_counts_;\n std::vector<size_t> free_pages_;\n pages_map_type hash_index_;\n\n // [... Methods]\n\n};\n
\nIn this initial version, the structure of the page store is made of:
\nmemory_
) to a (currently anonymous) mmap()
ed memory\nregion holding the memory pages (the address of the first page).capacity_
). Once all\nthose pages are used, we need to expand the page store with\nmremap()
.page_counts_
. Each time a\nsnapshot references a page, the counter is incremented. If a\nsnapshot is freed, the reference count is decremented. When the\nreference count of a page reaches 0, it is added to a list of\navailable pages (free_pages_
).free_pages_
which can be reused. This avoids\nhaving to scan the reference count list to find a free page.free_pages_
list and remove them just\nafterwards. The top_index_
field is an index after which all\npages are free and are not in the free_pages_
list.hash_index_
) mapping the hash of a\npage to the list of page indices with this hash.\nWe use a fast (non-cryptographic) hash so there may be conflicts:\nwe must be able to store multiple indices for the same hash.We want to keep this memory region (*memory_
) aligned on the memory pages (so\nthat we might be able to create non-linear memory mappings on those\npages in the future) and be able to expand it without copying the\ndata (there will be a lot of pages here): we will be able to\nefficiently expand the memory mapping using mremap()
, moving it\nto another virtual address if necessary.
void* new_memory = mremap(this->memory_, this->capacity_ << xbt_pagebits, newsize << xbt_pagebits, MREMAP_MAYMOVE);\nif (new_memory == MAP_FAILED) {\n xbt_die(\"Could not mremap snapshot pages.\");\n}\nthis->capacity_ = newsize;\nthis->memory_ = new_memory;\nthis->page_counts_.resize(newsize, 0);\n
\nBecause we will move this memory mapping on the virtual address\nspace, we only need to store index of the page in the snapshots\nand the page will always be looked up by going through memory_
:
const void* s_mc_pages_store::get_page(size_t pageno) const {\n return (char*) this->memory_ + pageno << pagebits;\n}\n
\nclass s_mc_pages_store {\n // [...]\n\npublic: // Ctor and dtor\n explicit s_mc_pages_store(size_t size);\n ~s_mc_pages_store();\n\npublic: // API\n\n void unref_page(size_t pageno);\n void ref_page(size_t pageno);\n size_t store_page(void* page);\n const void* get_page(size_t pageno) const;\n\nprivate:\n size_t alloc_page();\n\n};\n
\nget_page()
get_page()
returns a pointer to the memory from its index.
const void* s_mc_pages_store::get_page(size_t pageno) const {\n return (char*) this->memory_ + pageno << pagebits;\n}\n
\nstore_page()
store_page()
is used to store a page in the page store and return\nthe index of the stored page.
size_t s_mc_pages_store::store_page(void* page)\n{\n
\nFirst, we check if a page with the same content is already in the page\nstore:
\nhash_index_
;memcmp()
those pages with the one we are inserting to find a page with the same content. uint64_t hash = mc_hash_page(page);\n page_set_type& page_set = this->hash_index_[hash];\n BOOST_FOREACH (size_t pageno, page_set) {\n const void* snapshot_page = this->get_page(pageno);\n if (memcmp(page, snapshot_page, xbt_pagesize) == 0) {\n
\nIf a page with the same content is already in the page store it is\nreused and its reference count is incremented.
\n page_counts_[pageno]++;\n return pageno;\n }\n }\n
\nOtherwise, a new page is allocated in the page store and the content\nof the page is memcpy()
-ed to this new page.
size_t pageno = this->alloc_page();\n void* snapshot_page = (void*) this->get_page(pageno);\n memcpy(snapshot_page, page, xbt_pagesize);\n page_set.insert(pageno);\n page_counts_[pageno]++;\n return pageno;\n}\n
\nref_page()
This method used to increase a reference count of a page if we know\nthat the content of a page is the same as a page already in the page\nstore.
\nThis will be the case if a page if soft clean: we know that is has not\nchanged since the previous snapshot/restoration and we can avoid\nhashing the page, comparing byte-per-byte to candidates.
\nvoid s_mc_pages_store::ref_page(size_t pageno) {\n ++this->page_counts_[pageno];\n}\n
\nunref_page()
Decrement the reference count of this page. Used when a snapshot is\ndestroyed.
\nIf the reference count reaches zero, the page is recycled: it is added\nto the free_pages_
list and removed from the hash_index_
. In the\ncurrent implementation, we need to hash the page in order to find it\nin the index.
void s_mc_pages_store::unref_page(size_t pageno) {\n if ((--this->page_counts_[pageno]) == 0) {\n this->free_pages_.push_back(pageno);\n void* page = ((char*)this->memory_ + (pageno << pagebits));\n uint64_t hash = mc_hash_page(page);\n this->hash_index_[hash].erase(pageno);\n }\n}\n
\nCurrently the code is using djb2\nbut other hashes such as\nMurmur or\nCityHash are probably\nbetter.
\nIt is very easy to use a file (shared memory, FS file, block-device)\ninstead of anonymous memory for the page store: could we use this to\nparallelise the model checker on different processes or even machines?
\nI looked at my options to achieve efficient/cheap snapshots of the\nsimulated application for the Simgrid model checker using\ncopy-on-write. Here I look at another\nsolution to achieve this without using copy-on-write.
\nThe idea is to save each page of the state of the application\nindependently: when a snapshot page is stored, the snapshoting logic\nfirst checks if a page with the same content is already stored\nin the snapshot pages:
\nThe memory pages are only shared between the different snapshots but\nare never shared with the simulated application: copy-on-write is not\nused which means that the simulated application will not be slowned\ndown by the unsharing page faults. As a result, the basic solution can\nbe implemented purely in userspace.
\nThe first snapshot will be a full snapshot. Other snapshots will\nusually be shallow: if 98% of the memory pages are not touched between\nsuccessive snapshots, all those pages will be shared and only 2%\nof the pages will be copied in the second snapshot.
\nA hash of the content of the page can be used to limit the\ncomparison of the new memory page with only a subset of the stored\nmemory pages.
\nIt is still necessary to scan and hash all the pages of the\nstate of the simulated process each time a snapshot is done\nwhich seems to be quite inefficient.\nWe can use the\nsoft-dirty\nfeature of the Linux kernel to detect which pages have been written\nsince the previous snapshot and only try to store the modified\nones.
\nAfter each snapshot, each page of the process is marked as soft-clean\nand protected against write. Each time a soft-clean page is touched, a\npage fault is raised: the kernel marks the page as soft-dirty and\nremove the protection on the page. On the next snapshot, it is\npossible to find which pages are soft-dirty (i.e. were modified since\nthe previous snapshot) and only save those pages.
\nEven when restoring the state of a snapshot, we might use the\nsoft-dirty information to avoid copying data which have not changed:
\nIf a lot of pages do not change between snapshots, this technique\nreduces the number of pages which needs to be copied to restore\na snapshot (and avoid the related soft-dirty page faults).
\nOnce an efficient snapshoting strategy is implemented, I expect that\nin many cases, most of the time will be spent in the state comparison\ncode: we need to find a solution to avoid spending too much time\ntranslating between the addresses of the simulated application and the\naddresses of the snapshots.
\nWe might want to create a linear view of the areas of each\nsnapshot of in order to have a simple code for the address\ntranslation:
\nfind_snapshot_address(real_address, snapshot)\n{\n memory_area \u2190 find_memory_area(real_address)\n offset \u2190 real_address - memory_area.start\n snapshot_area \u2190 find_snapshot_area(memory_area)\n snapshot_address \u2190 snapshot_area.start + offset\n return snapshot_address\n}\n
\nMoreover and probably more importantly, as long as we are in the same\nmemory region, we can apply an offset in the real address translate\ninto the same offset in the linear view of the snapshot. This case\nhappens all the time when we are comparing the states:
\nIn all thoses cases, we could apply a simple offset from the base\nsnapshot address: if the memory\npages of the snapshot are scattered in the virtual memory space, the\nmodel checker will have to\napply the offset to the real base address\nand then translate the resulting address.
\nOne solution to create a linear view of a snapshot memory region,\nwould be to use a non-linear memory mapping (remap_file_pages
) of\nthe snapshot memory:
tmpfs
) file is used as a backend for the snapshot\npages;remap_file_pages()
can be used to create a linear view of a given\nsnapshot memory.However, one remap_file_pages()
call will be necessary per each\nmemory page so I do not expect this solution to be very promising\nunless a more efficient version of this system call is added in a\nlater release of the Linux kernel.
Update: remap_file_page is\ndeprecated.
\nAnother solution, is to create a linear copy of the snapshot areas.\nWe incrementally update those copies to reflect different snapshots\nbut only updating the pages which are different between the different\nsnapshots.
\nWhen we want to compare the current state against another one, we\nfirst have to recreate a linear view of the snapshot of the latter by\ncopying in the linear view all the memory pages which are different\nfrom the previous view.
\nWe want avoid reconstructing the state memory. This can be done by\ncreating a global hash of the state of the simulated application\nbased on key characteristics of the state (such\nas the numberf of processes, the instruction pointers of each process\nin its stack frame, etc.).
\nThe other solution is to replicate the algorithm of the MMU in\nsoftware to translate from virtual pages into file pages:
\nfind_snapshot_address(real_address, snapshot)\n{\n page_number \u2190 get_page_number(address)\n offset \u2190 get_offset(address)\n snapshot_page_number \u2190 get_snapshot_page_number(snapshot, page_number)\n snapshot_page_address \u2190 get_page_address(snapshot_page_number)\n snapshot_address \u2190 snapshot_page_address + offset\n return snapshot_address\n}\n
\nAs I said earlier, this might impact the performance of the state\ncomparison.
\nWe might use another granularity instead of the page:\nfor example we might snapshot at the malloc()
\ngranularity:
This approach seems quite promising:
\nIt is not clear which variation will be the more efficient.\nI am probably going to implement the software MMU approach.
\n"}, {"id": "http://www.gabriel.urdhr.fr/2014/06/02/cow-snapshots/", "title": "Copy-on-write snapshots for the SimGrid model checker", "url": "https://www.gabriel.urdhr.fr/2014/06/02/cow-snapshots/", "date_published": "2014-06-02T00:00:00+02:00", "date_modified": "2014-06-02T00:00:00+02:00", "tags": ["simgrid", "system", "computer", "checkpoint"], "content_html": "The SimGrid model checker\nexplores the graph of possible executions of\na simulated distributed application in order to verify safety and\nliveness properties. The model checker needs to store the state of the\napplication in each node of the execution graph in order to detect\ncycles. However, saving the whole state of the application at each\nnode of the graph leads to huge memory consumption and in some\ncases most of the time is spent copying data in order to take the\nsnapshots of the application. We will see how we could solve this problem,\nusing copy-on-write.
\nSimGrid simulates a distributed application on a single\nmachine in a single OS process: this allows very efficient\ntask switching as it can be done completely in\nuserspace. All simulated process uses a shared heap and\neach one uses its own stack which is allocated on this shared heap.
\nThe model checker lives in the same OS process and uses a separate\nheap. Each time it needs to take a snapshot of the application, the\nmodel checker makes a copy (using memcpy()
) of each memory area which\nis considered to contain a part of the state of the application:
.data
section\ncontaining the global variables of the application);Saving a lot of snapshots of the application can use a lot of memory. Some of\nthe applications we are trying to model-check use the whole 256 GiB of\nRAM of the machines we are using. Moreover in some applications, most\nof the time is spent copying the data (the diagram is made with\nFlameGraph):
\n\nsmpirun -wrapper \"perf record -g -e cycles\" -hostfile hostfile -platform msg_platform.xml -np 4 --cfg=model-check:1 --cfg=model-check/reduction:none --cfg=model-check/communications_determinism:1 --cfg=smpi/send_is_detached_thres:0 --cfg=model-check/max_depth:100000 --cfg=smpi/running_power:1e9 --cfg=contexts/factory:ucontext --cfg=model-check/visited:100 ./sp.S.4\nperf script | ~/src/FlameGraph/stackcollapse-perf.pl | grep -v '^\\[unknown\\];' | ~/src/FlameGraph/flamegraph.pl > sp.S.4.svg\n
\nIn practice, in many applications, only a small part of the memory of the\napplication has changed between successive states. In order to\nevaluate this, I modified the model-checker to use the Linux\nsoft-dirty\nmechanism:
\nOn the previous benchmark\n(sp.S.4
from the NAS Parallel Benchmarks Version 3.3),\n99% of the memory\npages of the state of the application\nwere not touched between successive snapshots: at least 99% of\nthe memory could be shared between successive snapshots\nwhen analysing this application.
Based on this observation, we would like to find a smarter way\nto take snapshots of the application with the following goals in mind:
\nThe KSM (Kernel Samepage Merging)\nmechanism of the Linux kernel can be\nused to enable automatic page sharing between snapshots:\nthe kernel finds memory pages with the same content, merges them\nand uses copy-on-write to unshare them if one of the virtual page is\nmodified later on.
\nIn order to do this, the application must mark each memory region\nwhere it wants the kernel to detect mergeable pages:
\nmadvise(start, length, MADV_MERGEABLE);\n
\nKSM must be enabled system-wide (as root) with:
\n# Enable KSM:\necho 1 > /sys/kernel/mm/ksm/run\n# Scan more pages:\necho 10000 > /sys/kernel/mm/ksm/pages_to_scan\n
\nThis solution is quite nonintrusive and has been implemented.
\nHowever, it does not address our second goal (avoid copying the data),\nthe page must be completely copied and then the KSM kernel process\nwill scan it to unmerge it. Moreover, scanning the pages in order to\nfind duplicates in quite CPU intensive. As a result (this part needs\nto be verified), the memory pages are deduplicated slower than they\nare allocated which means that the memory reduction is very limited in\npractice.
\nThis leads to the idea of doing explicit copy-on-write instead.
\nCopy-on-write in used on most POSIXish systems by the fork()
\nfunction. In the case of single-threaded application, a forked\nprocess could be seen as a snaphot of the simulated application.\nHowever, the snapshot memory does not live in the same virtual address\nspace and is not easily available to the model checker without copying\nit back into the main process.
mprotect()
A possible solution to implement copy-on-write would be to implement\nit in userspace using mprotect()
\nand remap_file_pages()
:
mprotect()
is used to protect the pages against copy;A memory-backed (tmpfs) file is used as an intermediate level between logical\npages and physical pages:\nphysical memory \u2192 file memory \u2192 virtual memory or swap.\nThe remap_file_pages()
Linux system call can be used to create a\nnon-linear mapping between physical pages and file pages.
However, this does not seem a suitable solution:
\nmprotect()
is not designed to be used on a page granularity on\nLinux but to be used on the vm_area_struct
which is the\nstructure used to in the kernel to represent a homogeneous memory\nmapping (they are seen in /proc/$pid/maps
). Using mprotect()
\non a part of a vm_area_struct
will split it in parts: using\nmprotect()
at the page granularity will generate a lot of\nvm_area_struct
splits and merges and a in general a lot of\nvm_area_struct
for the process. Having a lot of vm_area_struct
\nhave a bad impact on the performance of the application.remap_file_pages()
could be used to generate a non-linear\nmapping between physical pages and local pages but it does not\nseem to be designed to work at the page granularity as well: one\nsystem call per memory page must be used in this case.Update: remap_file_page is\ndeprecated.
\nSome operating systems expose a in-process copy-on-write\nfunctionality. Some Mach-based systems expose it using the\nvm_remap()
\nMach call.\nHowever, the only 64 bits OS supporting this seems to be\nXNU/Darwin/MacOS X/IOS:\nporting the model checker to XNU systems would\ntake a lot of time (and it seems Darwin without MacOS X is quite dead\nanyway). It sounded like a good excuse to try the Hurd\nwhich is based on a Mach kernel\nbut I discovered that it does not work wit 64 bits systems.
Linux does not expose an in-process copy-on-write functionality. I\ncould try to add this feature to the Linux kernel: the copy-on-write\nlogic would not be touched, the only missing bits is code to setup the\ncopy-on-write regions properly inside the same process and a interface\n(syscall option, etc.) to trigger it from userspace.\nI am not sure our chances of merging this feature would be very high.\nbut this might be a solution worth exploring in the future.
\nA native copy-on-write solution should address all of our goals. Page\nfaults with page deduplication will slow the application down: in\npractice if a small number of pages are modified between different\nsnapshots this should not be a big issue and I expect that it would\nstill be a big win compared to the current implementation.
\nIn the next episode, we will have a\nlook at non copy-on-write solutions based on userspace-managed page-level\nsnapshots.
\nFlamegraph\nis a software which generates SVG graphics\nto visualise stack-sampling based\nprofiles. It processes data collected with tools such as Linux perf,\nSystemTap, DTrace.
\nFor the impatient:
\nThe idea is that in order to know where your application is using CPU\ntime, you should sample its stack. You can get one sample of the\nstack(s) of a process with GDB:
\n# Sample the stack of the main (first) thread of a process:\ngdb -ex \"set pagination 0\" -ex \"bt\" -batch -p $(pidof okular)\n\n# Sample the stack of all threads of the process:\ngdb -ex \"set pagination 0\" -ex \"thread apply all bt\" -batch -p $(pidof okular)\n
\nThis generates backtraces such as:
\n\n[...]\nThread 2 (Thread 0x7f4d7bd56700 (LWP 15156)):\n#0 0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6\n#1 0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=2, fds=0x7f4d70002e70, timeout=-1, context=0x7f4d700009a0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028\n#2 g_main_context_iterate (context=context@entry=0x7f4d700009a0, block=block@entry=1, dispatch=dispatch@entry=1, self=\n) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729\n#3 0x00007f4d933750ec in g_main_context_iteration (context=0x7f4d700009a0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795\n#4 0x00007f4d9718b676 in QEventDispatcherGlib::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#5 0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#6 0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#7 0x00007f4d97059bef in QThread::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#8 0x00007f4d9713e763 in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#9 0x00007f4d9705c2bf in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#10 0x00007f4d93855062 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0\n#11 0x00007f4d96796c1d in clone () from /lib/x86_64-linux-gnu/libc.so.6\n\nThread 1 (Thread 0x7f4d997ab780 (LWP 15150)):\n#0 0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6\n#1 0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=8, fds=0x2f8a940, timeout=1998, context=0x1c747e0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028\n#2 g_main_context_iterate (context=context@entry=0x1c747e0, block=block@entry=1, dispatch=dispatch@entry=1, self= ) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729\n#3 0x00007f4d933750ec in g_main_context_iteration (context=0x1c747e0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795\n#4 0x00007f4d9718b655 in QEventDispatcherGlib::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#5 0x00007f4d97c017c6 in ?? () from /usr/lib/x86_64-linux-gnu/libQtGui.so.4\n#6 0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#7 0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<:processeventsflag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#8 0x00007f4d97162ab9 in QCoreApplication::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#9 0x00000000004082d6 in ?? ()\n#10 0x00007f4d966d2b45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6\n#11 0x0000000000409181 in _start ()\n[...]\n
By doing this a few times, you should be able to have an idea of\nwhat is taking time in your process (or thread).
\nTaking a few random stack samples of the process might be fine and\nhelp you in some cases but in order to have more accurate information,\nyou might want to take a lot of stack samples. FlameGraph can help you\nvisualize those stack samples.
\nFlameGraph reads a file from the standard input representing stack\nsamples in a simple format where each line represents a type of stack\nand the number of samples:
\n\nmain;init;init_boson_processor;malloc 2\nmain;init;init_logging;malloc 4\nmain;processing;compyte_value 8\nmain;cleanup;free 3\n\n
FlameGraph generates a corresponding SVG representation:
\n\nFlameGraph ships with a set of preprocessing scripts\n(stackcollapse-*.pl
) used to convert data from various\nperformance/profiling tools into this simple format\nwhich means you can use FlameGraph with perf, DTrace,\nSystemTap or your own tool:
your_tool | flamegraph_preprocessor_for_your_tool | flamegraph > result.svg\n
\nIt is very easy to add support for a new tool in a few lines of\nscripts. I wrote a\npreprocessor\nfor the GDB backtrace
output (produced by the previous poor man's\nprofiler script) which is now available\nin the main repository.
As FlameGraph uses a tool-neutral line-oriented format, it is very\neasy to add generic filters after the preprocessor (using sed
,\ngrep
, etc.):
the_tool | flamegraph_preprocessor_for_the_tool | filters | flamegraph > result.svg\n
\nUpdate 2015-08-22:\nElfutils ships a stack
program\n(called eu-stack
on Debian) which seems to be much faster than GDB\nfor using as a poor person's Profiler in a shell script. I wrote a\nscript in order to feed its output to\nFlameGraph.
perf is a very powerful tool for Linux to do performance analysis of\nprograms. For example, here is how we can generate\nan on-CPU\nFlameGraph of an application using perf:
\n# Use perf to do a time based sampling of an application (on-CPU):\nperf record -F99 --call-graph dwarf myapp\n\n# Turn the data into a cute SVG:\nperf script | stackcollapse-perf.pl | flamegraph.pl > myapp.svg\n
\nThis samples the on-CPU time, excluding time when the process in not\nscheduled (idle, waiting on a semaphore, etc.) which may not be what you\nwant. It is possible to sample\noff-CPU\ntime as well with\nperf.
\nThe simple and fast solution[1] is to use the frame pointer\nto unwind the stack frames (--call-graph fp
). However, frame pointer\ntends to be omitted these days (it is not mandated by the x86_64 ABI):\nit might not work very well unless you recompile code and dependencies\nwithout omitting the frame pointer (-fno-omit-frame-pointer
).
Another solution is to use CFI to unwind the stack (with --call-graph dwarf
): this uses either the DWARF CFI (.debug_frame
section) or\nruntime stack unwinding (.eh_frame
section). The CFI must be present\nin the application and shared-objects (with\n-fasynchronous-unwind-tables
or -g
). On x86_64, .eh_frame
should\nbe enabled by default.
Update 2015-09-19: Another solution on recent Intel chips (and\nrecent kernels) is to use the hardware LBR (Last Branch Record)\nregisters (with --call-graph lbr
).
As FlameGraph uses a simple line oriented format, it is very easy to\nfilter/transform the data by placing a filter between the\nstackcollapse
preprocessor and FlameGraph:
# I am only interested in what is happening in MAIN():\nperf script | stackcollapse-perf.pl | grep MAIN | flamegraph.pl > MAIN.svg\n\n# I am not interested in what is happening in init():\nperf script | stackcollapse-perf.pl | grep -v init | flamegraph.pl > noinit.svg\n\n# Let's pretend that realloc() is the same thing as malloc():\nperf script | stackcollapse-perf.pl | sed/realloc/malloc/ | flamegraph.pl > alloc.svg\n
\nIf you have recursive calls you might want to merge them in order to\nhave a more readable view. This is implemented in my\nbranch\nby stackfilter-recursive.pl
:
# I want to merge recursive calls:\nperf script | stackcollapse-perf.pl | stackfilter-recursive.pl | grep MAIN | flamegraph.pl\n
\nUpdate 2015-10-16: this has been merged upstream.
\nSometimes you might not be able to get relevant information with\nperf
. This might be because you do not have debugging symbols for\nsome libraries you are using: you will end up with missing\ninformation in the stacktrace. In this case, you might want to use GDB\ninstead using the poor man's profiler\nmethod because it tends to be better at unwinding the stack without\nframe pointer and debugging information:
# Sample an already running process:\npmp 500 0.1 $(pidof mycommand) > mycommand.gdb\n\n# Or:\nmycommand my_arguments &\npmp 500 0.1 $!\n\n# Generate the SVG:\ncat mycommand.gdb | stackcollapse-gdb.pl | flamegraph.pl > mycommand.svg\n
\nWhere pmp
is a poor man's profiler script such as:
#!/bin/bash\n# pmp - \"Poor man's profiler\" - Inspired by http://poormansprofiler.org/\n# See also: http://dom.as/tag/gdb/\n\nnsamples=$1\nsleeptime=$2\npid=$3\n\n# Sample stack traces:\nfor x in $(seq 1 $nsamples); do\n gdb -ex \"set pagination 0\" -ex \"thread apply all bt\" -batch -p $pid 2> /dev/null\n sleep $sleeptime\ndone\n
\nUsing this technique will slow the application a lot.
\nCompared to the example with perf, this approach samples both on-CPU\nand off-CPU time.
\nHere are some figures obtained when I was optimising the\nSimgrid\nmodel checker\non a given application\nusing the poor man's profiler to sample the stack.
\nHere is the original profile before optimisation:
\n\n82% of the time is spent in get_type_description()
. In fact, the\nmodel checker spends its time looking up type description in some hash tables\nagain and over again.
Let's fix this and store a pointer to the type description instead of\na type identifier in order to avoid looking up those type over\nand over again:
\n\nAfter this modification,\n32% of the time is spent in libunwind get_proc_name()
(looking up\nfunctions name from given values of the instruction pointer) and\n13% is spent reading and parsing the output of cat /proc/self/maps
\nover and over again (in xbt_getline()
). Let's fix the second issue first\nbecause it is simple: we can cache the memory mapping of the process in\norder to avoid parsing /proc/self/maps
all of time.
Now, let's fix the other issue by resolving the functions\nourselves. It turns out, we already had the address range of each function\nin memory (parsed from DWARF informations). All we have to do is use a\nbinary search in order to have a nice O(log n) lookup[2].
\n\nStill 17% of the time is spent looking up type descriptions from type\nidentifiers in a hash table. Let's store the reference to the type\ndescriptions and avoid this:
\n\nThe non-optimised version was taking 2 minutes to complete. With\nthose optimisations, it takes only 6 seconds \ud83d\ude2e. There is\nstill room for optimisation here as 30% of the time is now spent in\nmalloc()
/free()
managing heap information.
Perf can sample many other kind of events (hardware performance\ncounters, software performance counters, tracepoints, etc.). You can get\nthe list of available events with perf list
. If you run it as\nroot you will have a lot more events (all the kernel tracepoints).
Here are some interesting events:
\ncache-misses
are in general last level cache misses (the\ndata in not in any cache and must be fetched from RAM which\nis much slower).page-faults
.More information about some perf events can be found in\nperf_event_open(2)
.
You can then sample an event with:
\nperf record --call-graph dwarf -e cache-misses myapp\n
\n\n_ZTSSt9bad_alloc@@GLIBCXX_3.4
),\nc++filt
can be used after the stackcollapse
script to demangle them.--reverse
flag of flamegraph.pl
.perf
./proc/$pid/stack
.If you liked this post, you might as well like:
\n\nWhen using frame pointer unwinding, the kernel unwinds the stack\nitself and only gives the instruction pointer of each frame to\nperf record
. This behaviour is triggered by the\nPERF_SAMPLE_CALLCHAIN
sample type.
When using DWARF unwinding, the kernels takes a snaphots of (a\npart of) the stack, gives it to perf record
: perf record
\nstores it in a file and the DWARF unwinding is done afterwards by\nthe perf tools. This uses\nPERF_SAMPLE_STACK_USER
. PERF_SAMPLE_CALLCHAIN
is used as well\nbut for the kernel-side stack (exclude_callchain_user
). \u21a9\ufe0e
Cache friendliness could probably be better however.\nSee for example\nCache-friendly binary search. \u21a9\ufe0e
\n