{"version": "https://jsonfeed.org/version/1", "title": "/dev/posts/ - Tag index - simgrid", "home_page_url": "https://www.gabriel.urdhr.fr", "feed_url": "/tags/simgrid/feed.json", "items": [{"id": "http://www.gabriel.urdhr.fr/2016/08/01/simgrid-synchronisation/", "title": "C++ synchronisations for SimGrid", "url": "https://www.gabriel.urdhr.fr/2016/08/01/simgrid-synchronisation/", "date_published": "2016-08-01T00:00:00+02:00", "date_modified": "2016-08-01T00:00:00+02:00", "tags": ["computer", "simgrid", "c++", "future"], "content_html": "

This is an overview of some recent additions to the SimGrid code\nrelated to actor synchronisation. It might be interesting for people\nusing SimGrid, working on SimGrid or for people interested in generic\nC++ code for synchronisation or asynchronicity.

\n

Table of Content

\n
\n\n
\n

SimGrid as a Discrete Event Simulator

\n

SimGrid is a discrete event simulator of\ndistributed systems: it does not simulate the world by small fixed-size steps\nbut determines the date of the next event (such as the end of a communication,\nthe end of a computation) and jumps to this date.

\n

A number of actors executing user-provided code run on top of the\nsimulation kernel1. When an actor needs to interact with the simulation\nkernel (eg. to start a communication), it issues a simcall\n(simulation call, an analogy to system calls) to the simulation kernel.\nThis freezes the actor until it is woken up by the simulation kernel\n(eg. when the communication is finished).

\n

The key ideas here are:

\n\n

Futures

\n

What is a future?

\n

We need a generic way to represent asynchronous operations in the\nsimulation kernel. Futures\nare a nice abstraction for this which have been added to a lot languages\n(Java, Python, C++ since C++11, ECMAScript, etc.)9.

\n

A future represents the result of an asynchronous operation. As the operation\nmay not be completed yet, its result is not available yet. Two different sort\nof APIs may be available to expose this future result:

\n\n

C++11 includes a generic class (std::future<T>) which implements a blocking API.\nThe continuation-based API\nis not available in the standard (yet) but is described in the\nConcurrency Technical\nSpecification.

\n

Which future do we need?

\n

We might want to use a solution based on std::future but our need is slightly\ndifferent from the C++11 futures. C++11 futures are not suitable for usage inside\nthe simulation kernel because they are only providing a blocking API\n(future.get()) whereas the simulation kernel cannot block.\nInstead, we need a continuation-based API to be used in our event-driven\nsimulation kernel.

\n

The C++ Concurrency TS describes a continuation-based API.\nOur future are based on this with a few differences5:

\n\n

Implementing Future

\n

The implementation of future is in simgrid::kernel::Future and\nsimgrid::kernel::Promise6 and is based on the Concurrency\nTS3:

\n

The future and the associated promise use a shared state defined with:

\n
enum class FutureStatus {\n  not_ready,\n  ready,\n  done,\n};\n\nclass FutureStateBase : private boost::noncopyable {\npublic:\n  void schedule(simgrid::xbt::Task<void()>&& job);\n  void set_exception(std::exception_ptr exception);\n  void set_continuation(simgrid::xbt::Task<void()>&& continuation);\n  FutureStatus get_status() const;\n  bool is_ready() const;\n  // [...]\nprivate:\n  FutureStatus status_ = FutureStatus::not_ready;\n  std::exception_ptr exception_;\n  simgrid::xbt::Task<void()> continuation_;\n};\n\ntemplate<class T>\nclass FutureState : public FutureStateBase {\npublic:\n  void set_value(T value);\n  T get();\nprivate:\n  boost::optional<T> value_;\n};\n\ntemplate<class T>\nclass FutureState<T&> : public FutureStateBase {\n  // ...\n};\ntemplate<>\nclass FutureState<void> : public FutureStateBase {\n  // ...\n};\n
\n\n\n

Both Future and Promise have a reference to the shared state:

\n
template<class T>\nclass Future {\n  // [...]\nprivate:\n  std::shared_ptr<FutureState<T>> state_;\n};\n\ntemplate<class T>\nclass Promise {\n  // [...]\nprivate:\n  std::shared_ptr<FutureState<T>> state_;\n  bool future_get_ = false;\n};\n
\n\n\n

The crux of future.then() is:

\n
template<class T>\ntemplate<class F>\nauto simgrid::kernel::Future<T>::thenNoUnwrap(F continuation)\n-> Future<decltype(continuation(std::move(*this)))>\n{\n  typedef decltype(continuation(std::move(*this))) R;\n\n  if (state_ == nullptr)\n    throw std::future_error(std::future_errc::no_state);\n\n  auto state = std::move(state_);\n  // Create a new future...\n  Promise<R> promise;\n  Future<R> future = promise.get_future();\n  // ...and when the current future is ready...\n  state->set_continuation(simgrid::xbt::makeTask(\n    [](Promise<R> promise, std::shared_ptr<FutureState<T>> state,\n         F continuation) {\n      // ...set the new future value by running the continuation.\n      Future<T> future(std::move(state));\n      simgrid::xbt::fulfillPromise(promise,[&]{\n        return continuation(std::move(future));\n      });\n    },\n    std::move(promise), state, std::move(continuation)));\n  return std::move(future);\n}\n
\n\n\n

We added a (much simpler) future.then_() method which does not\ncreate a new future:

\n
template<class T>\ntemplate<class F>\nvoid simgrid::kernel::Future<T>::then_(F continuation)\n{\n  if (state_ == nullptr)\n    throw std::future_error(std::future_errc::no_state);\n  // Give shared-ownership to the continuation:\n  auto state = std::move(state_);\n  state->set_continuation(simgrid::xbt::makeTask(\n    std::move(continuation), state));\n}\n
\n\n\n

The .get() delegates to the shared state. As we mentioned previously, an\nerror is raised if the future is not ready:

\n
template<class T>\nT simgrid::kernel::Future::get()\n{\n  if (state_ == nullptr)\n    throw std::future_error(std::future_errc::no_state);\n  std::shared_ptr<FutureState<T>> state = std::move(state_);\n  return state->get();\n}\n\ntemplate<class T>\nT simgrid::kernel::FutureState<T>::get()\n{\n  if (status_ != FutureStatus::ready)\n    xbt_die(\"Deadlock: this future is not ready\");\n  status_ = FutureStatus::done;\n  if (exception_) {\n    std::exception_ptr exception = std::move(exception_);\n    exception_ = nullptr;\n    std::rethrow_exception(std::move(exception));\n  }\n  xbt_assert(this->value_);\n  auto result = std::move(this->value_.get());\n  this->value_ = boost::optional<T>();\n  return std::move(result);\n}\n
\n\n\n

Generic simcalls

\n

Motivation

\n

Simcalls are not so easy to understand and adding a new one is not so easy\neither. In order to add one simcall, one has to first\nadd it to the list of simcalls\nwhich looks like this:

\n
# This looks like C++ but it is a basic IDL-like language\n# (one definition per line) parsed by a python script:\n\nvoid process_kill(smx_process_t process);\nvoid process_killall(int reset_pid);\nvoid process_cleanup(smx_process_t process) [[nohandler]];\nvoid process_suspend(smx_process_t process) [[block]];\nvoid process_resume(smx_process_t process);\nvoid process_set_host(smx_process_t process, sg_host_t dest);\nint  process_is_suspended(smx_process_t process) [[nohandler]];\nint  process_join(smx_process_t process, double timeout) [[block]];\nint  process_sleep(double duration) [[block]];\n\nsmx_mutex_t mutex_init();\nvoid        mutex_lock(smx_mutex_t mutex) [[block]];\nint         mutex_trylock(smx_mutex_t mutex);\nvoid        mutex_unlock(smx_mutex_t mutex);\n\n[...]\n
\n\n\n

At runtime, a simcall is represented by a structure containing a simcall\nnumber and its arguments (among some other things):

\n
struct s_smx_simcall {\n  // Simcall number:\n  e_smx_simcall_t call;\n  // Issuing actor:\n  smx_process_t issuer;\n  // Arguments of the simcall:\n  union u_smx_scalar args[11];\n  // Result of the simcall:\n  union u_smx_scalar result;\n  // Some additional stuff:\n  smx_timer_t timer;\n  int mc_value;\n};\n
\n\n\n

with the a scalar union type:

\n
union u_smx_scalar {\n  char            c;\n  short           s;\n  int             i;\n  long            l;\n  long long       ll;\n  unsigned char   uc;\n  unsigned short  us;\n  unsigned int    ui;\n  unsigned long   ul;\n  unsigned long long ull;\n  double          d;\n  void*           dp;\n  FPtr            fp;\n};\n
\n\n\n

Then one has to call (manually\"\ud83d\ude22\") a\nPython script\nwhich generates a bunch of C++ files:

\n\n

Then one has to write the code of the kernel side handler for the simcall\nand the code of the simcall itself (which calls the code-generated\nmarshaling/unmarshaling stuff)\"\ud83d\ude2d\".

\n

In order to simplify this process, we added two generic simcalls which\ncan be used to execute a function in the simulation kernel context:

\n
# This one should really be called run_immediate:\nvoid run_kernel(std::function<void()> const* code) [[nohandler]];\nvoid run_blocking(std::function<void()> const* code) [[block,nohandler]];\n
\n\n\n

Immediate simcall

\n

The first one (simcall_run_kernel()) executes a function in the simulation\nkernel context and returns immediately (without blocking the actor):

\n
void simcall_run_kernel(std::function<void()> const& code)\n{\n  simcall_BODY_run_kernel(&code);\n}\n\ntemplate<class F> inline\nvoid simcall_run_kernel(F& f)\n{\n  simcall_run_kernel(std::function<void()>(std::ref(f)));\n}\n
\n\n\n

On top of this, we add a wrapper which can be used to return a value of any\ntype and properly handles exceptions:

\n
template<class F>\ntypename std::result_of<F()>::type kernelImmediate(F&& code)\n{\n  // If we are in the simulation kernel, we take the fast path and\n  // execute the code directly without simcall\n  // marshalling/unmarshalling/dispatch:\n  if (SIMIX_is_maestro())\n    return std::forward<F>(code)();\n\n  // If we are in the application, pass the code to the simulation\n  // kernel which executes it for us and reports the result:\n  typedef typename std::result_of<F()>::type R;\n  simgrid::xbt::Result<R> result;\n  simcall_run_kernel([&]{\n    xbt_assert(SIMIX_is_maestro(), \"Not in maestro\");\n    simgrid::xbt::fulfillPromise(result, std::forward<F>(code));\n  });\n  return result.get();\n}\n
\n\n\n

where Result<R> can store either a R or an exception.

\n

Example of usage:

\n
xbt_dict_t Host::properties() {\n  return simgrid::simix::kernelImmediate([&] {\n    simgrid::surf::HostImpl* surf_host =\n      this->extension<simgrid::surf::HostImpl>();\n    return surf_host->getProperties();\n  });\n}\n
\n\n\n

In this example, the kernelImmediate() call is not in user code but\nin the framework code. We do not expect the normal user to write\nsimulator kernel code. Those mechanisms are intended to be used by\nthe implementer of the framework in order to implement user\nprimitives.

\n

Blocking simcall

\n

The second generic simcall (simcall_run_blocking()) executes a function in\nthe SimGrid simulation kernel immediately but does not wake up the calling actor\nimmediately:

\n
void simcall_run_blocking(std::function<void()> const& code);\n\ntemplate<class F>\nvoid simcall_run_blocking(F& f)\n{\n  simcall_run_blocking(std::function<void()>(std::ref(f)));\n}\n
\n\n\n

The f function is expected to setup some callbacks in the simulation\nkernel which will wake up the actor (with\nsimgrid::simix::unblock(actor)) when the operation is completed.

\n

This is wrapped in a higher-level primitive as well. The\nkernelSync() function expects a function-object which is executed\nimmediately in the simulation kernel and returns a Future<T>. The\nsimulator blocks the actor and resumes it when the Future<T> becomes\nready with its result:

\n
template<class F>\nauto kernelSync(F code) -> decltype(code().get())\n{\n  typedef decltype(code().get()) T;\n  if (SIMIX_is_maestro())\n    xbt_die(\"Can't execute blocking call in kernel mode\");\n\n  smx_process_t self = SIMIX_process_self();\n  simgrid::xbt::Result<T> result;\n\n  simcall_run_blocking([&result, self, &code]{\n    try {\n      auto future = code();\n      future.then_([&result, self](simgrid::kernel::Future<T> value) {\n        // Propagate the result from the future\n        // to the simgrid::xbt::Result:\n        simgrid::xbt::setPromise(result, value);\n        simgrid::simix::unblock(self);\n      });\n    }\n    catch (...) {\n      // The code failed immediately. We can wake up the actor\n      // immediately with the exception:\n      result.set_exception(std::current_exception());\n      simgrid::simix::unblock(self);\n    }\n  });\n\n  // Get the result of the operation (which might be an exception):\n  return result.get();\n}\n
\n\n\n

A contrived example of this would be:

\n
int res = simgrid::simix::kernelSync([&] {\n  return kernel_wait_until(30).then(\n    [](simgrid::kernel::Future<void> future) {\n      return 42;\n    }\n  );\n});\n
\n\n\n

A more realistic example (implementing user-level primitives) would\nbe:

\n
sg_size_t File::read(sg_size_t size)\n{\n  return simgrid::simix::kernelSync([&] {\n    return file_->async_read(size);\n  });\n}\n
\n\n\n

Asynchronous operations

\n

We can write the related kernelAsync() which wakes up the actor immediately\nand returns a future to the actor. As this future is used in the actor context,\nit is a different future\n(simgrid::simix::Future instead of simgrid::kernel::Future)\nwhich implements a C++11 std::future wait-based API:

\n
template <class T>\nclass Future {\npublic:\n  Future() {}\n  Future(simgrid::kernel::Future<T> future) : future_(std::move(future)) {}\n  bool valid() const { return future_.valid(); }\n  T get();\n  bool is_ready() const;\n  void wait();\nprivate:\n  // We wrap an event-based kernel future:\n  simgrid::kernel::Future<T> future_;\n};\n
\n\n\n

The future.get() method is implemented as4:

\n
template<class T>\nT simgrid::simix::Future<T>::get()\n{\n  if (!valid())\n    throw std::future_error(std::future_errc::no_state);\n  smx_process_t self = SIMIX_process_self();\n  simgrid::xbt::Result<T> result;\n  simcall_run_blocking([this, &result, self]{\n    try {\n      // When the kernel future is ready...\n      this->future_.then_(\n        [this, &result, self](simgrid::kernel::Future<T> value) {\n          // ... wake up the process with the result of the kernel future.\n          simgrid::xbt::setPromise(result, value);\n          simgrid::simix::unblock(self);\n      });\n    }\n    catch (...) {\n      result.set_exception(std::current_exception());\n      simgrid::simix::unblock(self);\n    }\n  });\n  return result.get();\n}\n
\n\n\n

kernelAsync() simply \"\ud83d\ude09\" calls kernelImmediate() and wraps the\nsimgrid::kernel::Future into a simgrid::simix::Future:

\n
template<class F>\nauto kernelAsync(F code)\n  -> Future<decltype(code().get())>\n{\n  typedef decltype(code().get()) T;\n\n  // Execute the code in the simulation kernel and get the kernel future:\n  simgrid::kernel::Future<T> future =\n    simgrid::simix::kernelImmediate(std::move(code));\n\n  // Wrap the kernel future in a user future:\n  return simgrid::simix::Future<T>(std::move(future));\n}\n
\n\n\n

A contrived example of this would be:

\n
simgrid::simix::Future<int> future = simgrid::simix::kernelSync([&] {\n  return kernel_wait_until(30).then(\n    [](simgrid::kernel::Future<void> future) {\n      return 42;\n    }\n  );\n});\ndo_some_stuff();\nint res = future.get();\n
\n\n\n

A more realistic example (implementing user-level primitives) would\nbe:

\n
simgrid::simix::Future<sg_size_t> File::async_read(sg_size_t size)\n{\n  return simgrid::simix::kernelAsync([&] {\n    return file_->async_read(size);\n  });\n}\n
\n\n\n

kernelSync() could be rewritten as:

\n
template<class F>\nauto kernelSync(F code) -> decltype(code().get())\n{\n  return kernelAsync(std::move(code)).get();\n}\n
\n\n\n

The semantic is equivalent but this form would require two simcalls\ninstead of one to do the same job (one in kernelAsync() and one in\n.get()).

\n

Representing the simulated time

\n

SimGrid uses double for representing the simulated time:

\n\n

In contrast, all the C++ APIs use std::chrono::duration and\nstd::chrono::time_point. They are used in:

\n\n

We can define future.wait_for(duration) and future.wait_until(timepoint)\nfor our futures but for better compatibility with standard C++ code, we might\nwant to define versions expecting std::chrono::duration and\nstd::chrono::time_point.

\n

For time points, we need to define a clock (which meets the\nTrivialClock\nrequirements, see\n[time.clock.req]\nworking in the simulated time in the C++14 standard):

\n
struct SimulationClock {\n  using rep        = double;\n  using period     = std::ratio<1>;\n  using duration   = std::chrono::duration<rep, period>;\n  using time_point = std::chrono::time_point<SimulationClock, duration>;\n  static constexpr bool is_steady = true;\n  static time_point now()\n  {\n    return time_point(duration(SIMIX_get_clock()));\n  }\n};\n
\n\n\n

A time point in the simulation is a time point using this clock:

\n
template<class Duration>\nusing SimulationTimePoint =\n  std::chrono::time_point<SimulationClock, Duration>;\n
\n\n\n

This is used for example in simgrid::s4u::this_actor::sleep_for() and\nsimgrid::s4u::this_actor::sleep_until():

\n
void sleep_for(double duration)\n{\n  if (duration > 0)\n    simcall_process_sleep(duration);\n}\n\nvoid sleep_until(double timeout)\n{\n  double now = SIMIX_get_clock();\n  if (timeout > now)\n    simcall_process_sleep(timeout - now);\n}\n\ntemplate<class Rep, class Period>\nvoid sleep_for(std::chrono::duration<Rep, Period> duration)\n{\n  auto seconds =\n    std::chrono::duration_cast<SimulationClockDuration>(duration);\n  this_actor::sleep_for(seconds.count());\n}\n\ntemplate<class Duration>\nvoid sleep_until(const SimulationTimePoint<Duration>& timeout_time)\n{\n  auto timeout_native =\n    std::chrono::time_point_cast<SimulationClockDuration>(timeout_time);\n  this_actor::sleep_until(timeout_native.time_since_epoch().count());\n}\n
\n\n\n

Which means it is possible to use (since C++14):

\n
using namespace std::chrono_literals;\nsimgrid::s4u::actor::sleep_for(42s);\n
\n\n\n

Mutexes and condition variables

\n

Mutexes

\n

SimGrid has had a C-based API for mutexes and condition variables for\nsome time. These mutexes are different from the standard\nsystem-level mutex (std::mutex, pthread_mutex_t, etc.) because\nthey work at simulation-level. Locking on a simulation mutex does\nnot block the thread directly but makes a simcall\n(simcall_mutex_lock()) which asks the simulation kernel to wake the calling\nactor when it can get ownership of the mutex. Blocking directly at the\nOS level would deadlock the simulation.

\n

Reusing the C++ standard API for our simulation mutexes has many\nbenefits:

\n\n

We defined a reference-counted Mutex class for this (which supports\nthe Lockable\nrequirements, see\n[thread.req.lockable.req]\nin the C++14 standard):

\n
class Mutex {\n  friend ConditionVariable;\nprivate:\n  friend simgrid::simix::Mutex;\n  simgrid::simix::Mutex* mutex_;\n  Mutex(simgrid::simix::Mutex* mutex) : mutex_(mutex) {}\npublic:\n\n  friend void intrusive_ptr_add_ref(Mutex* mutex);\n  friend void intrusive_ptr_release(Mutex* mutex);\n  using Ptr = boost::intrusive_ptr<Mutex>;\n\n  // No copy:\n  Mutex(Mutex const&) = delete;\n  Mutex& operator=(Mutex const&) = delete;\n\n  static Ptr createMutex();\n\npublic:\n  void lock();\n  void unlock();\n  bool try_lock();\n};\n
\n\n\n

The methods are simply wrappers around existing simcalls:

\n
void Mutex::lock()\n{\n  simcall_mutex_lock(mutex_);\n}\n
\n\n\n

Using the same API as std::mutex (Lockable) means we can use existing\nC++-standard code such as std::unique_lock<Mutex> or\nstd::lock_guard<Mutex> for exception-safe mutex handling8:

\n
{\n  std::lock_guard<simgrid::s4u::Mutex> lock(*mutex);\n  sum += 1;\n}\n
\n\n\n

Condition Variables

\n

Similarly SimGrid already had simulation-level condition variables\nwhich can be exposed using the same API as std::condition_variable:

\n
class ConditionVariable {\nprivate:\n  friend s_smx_cond;\n  smx_cond_t cond_;\n  ConditionVariable(smx_cond_t cond) : cond_(cond) {}\npublic:\n\n  ConditionVariable(ConditionVariable const&) = delete;\n  ConditionVariable& operator=(ConditionVariable const&) = delete;\n\n  friend void intrusive_ptr_add_ref(ConditionVariable* cond);\n  friend void intrusive_ptr_release(ConditionVariable* cond);\n  using Ptr = boost::intrusive_ptr<ConditionVariable>;\n  static Ptr createConditionVariable();\n\n  void wait(std::unique_lock<Mutex>& lock);\n  template<class P>\n  void wait(std::unique_lock<Mutex>& lock, P pred);\n\n  // Wait functions taking a plain double as time:\n\n  std::cv_status wait_until(std::unique_lock<Mutex>& lock,\n    double timeout_time);\n  std::cv_status wait_for(\n    std::unique_lock<Mutex>& lock, double duration);\n  template<class P>\n  bool wait_until(std::unique_lock<Mutex>& lock,\n    double timeout_time, P pred);\n  template<class P>\n  bool wait_for(std::unique_lock<Mutex>& lock,\n    double duration, P pred);\n\n  // Wait functions taking a std::chrono time:\n\n  template<class Rep, class Period, class P>\n  bool wait_for(std::unique_lock<Mutex>& lock,\n    std::chrono::duration<Rep, Period> duration, P pred);\n  template<class Rep, class Period>\n  std::cv_status wait_for(std::unique_lock<Mutex>& lock,\n    std::chrono::duration<Rep, Period> duration);\n  template<class Duration>\n  std::cv_status wait_until(std::unique_lock<Mutex>& lock,\n    const SimulationTimePoint<Duration>& timeout_time);\n  template<class Duration, class P>\n  bool wait_until(std::unique_lock<Mutex>& lock,\n    const SimulationTimePoint<Duration>& timeout_time, P pred);\n\n  // Notify:\n\n  void notify_one();\n  void notify_all();\n\n};\n
\n\n\n

We currently accept both double (for simplicity and consistency with\nthe current codebase) and std::chrono types (for compatibility with\nC++ code) as durations and timepoints. One important thing to notice here is\nthat cond.wait_for() and cond.wait_until() work in the simulated time,\nnot in the real time.

\n

The simple cond.wait() and cond.wait_for() delegate to\npre-existing simcalls:

\n
void ConditionVariable::wait(std::unique_lock<Mutex>& lock)\n{\n  simcall_cond_wait(cond_, lock.mutex()->mutex_);\n}\n\nstd::cv_status ConditionVariable::wait_for(\n  std::unique_lock<Mutex>& lock, double timeout)\n{\n  // The simcall uses -1 for \"any timeout\" but we don't want this:\n  if (timeout < 0)\n    timeout = 0.0;\n\n  try {\n    simcall_cond_wait_timeout(cond_, lock.mutex()->mutex_, timeout);\n    return std::cv_status::no_timeout;\n  }\n  catch (xbt_ex& e) {\n\n    // If the exception was a timeout, we have to take the lock again:\n    if (e.category == timeout_error) {\n      try {\n        lock.mutex()->lock();\n        return std::cv_status::timeout;\n      }\n      catch (...) {\n        std::terminate();\n      }\n    }\n\n    std::terminate();\n  }\n  catch (...) {\n    std::terminate();\n  }\n}\n
\n\n\n

Other methods are simple wrappers around those two:

\n
template<class P>\nvoid ConditionVariable::wait(std::unique_lock<Mutex>& lock, P pred)\n{\n  while (!pred())\n    wait(lock);\n}\n\ntemplate<class P>\nbool ConditionVariable::wait_until(std::unique_lock<Mutex>& lock,\n  double timeout_time, P pred)\n{\n  while (!pred())\n    if (this->wait_until(lock, timeout_time) == std::cv_status::timeout)\n      return pred();\n  return true;\n}\n\ntemplate<class P>\nbool ConditionVariable::wait_for(std::unique_lock<Mutex>& lock,\n  double duration, P pred)\n{\n  return this->wait_until(lock,\n    SIMIX_get_clock() + duration, std::move(pred));\n}\n
\n\n\n

Conclusion

\n

We wrote two future implementations based on the std::future API:

\n\n

These futures are used to implement kernelSync() and kernelAsync() which\nexpose asynchronous operations in the simulation kernel to the actors.

\n

In addition, we wrote variations of some other C++ standard library\nclasses (SimulationClock, Mutex, ConditionVariable) which work in\nthe simulation:

\n\n

Reusing the same API as the C++ standard library is very useful because:

\n\n

This type of approach might be useful for other libraries which define\ntheir own contexts. An example of this is\nMordor, a I/O library using fibers\n(cooperative scheduling): it implements cooperative/fiber\nmutex,\nrecursive\nmutex\nwhich are compatible with the\nBasicLockable\nrequirements (see\n[thread.req.lockable.basic]\nin the C++14 standard).

\n

Appendix: useful helpers

\n

Result

\n

Result is like a mix of std::future and std::promise in a\nsingle-object without shared-state and synchronisation:

\n
template<class T>\nclass Result {\n  enum class ResultStatus {\n    invalid,\n    value,\n    exception,\n  };\npublic:\n  Result();\n  ~Result();\n  Result(Result const& that);\n  Result& operator=(Result const& that);\n  Result(Result&& that);\n  Result& operator=(Result&& that);\n  bool is_valid() const;\n  void reset();\n  void set_exception(std::exception_ptr e);\n  void set_value(T&& value);\n  void set_value(T const& value);\n  T get();\nprivate:\n  ResultStatus status_ = ResultStatus::invalid;\n  union {\n    T value_;\n    std::exception_ptr exception_;\n  };\n};\n
\n\n\n

Promise helpers

\n

Those helper are useful for dealing with generic future-based code:

\n
template<class R, class F>\nauto fulfillPromise(R& promise, F&& code)\n-> decltype(promise.set_value(code()))\n{\n  try {\n    promise.set_value(std::forward<F>(code)());\n  }\n  catch(...) {\n    promise.set_exception(std::current_exception());\n  }\n}\n\ntemplate<class P, class F>\nauto fulfillPromise(P& promise, F&& code)\n-> decltype(promise.set_value())\n{\n  try {\n    std::forward<F>(code)();\n    promise.set_value();\n  }\n  catch(...) {\n    promise.set_exception(std::current_exception());\n  }\n}\n\ntemplate<class P, class F>\nvoid setPromise(P& promise, F&& future)\n{\n  fulfillPromise(promise, [&]{ return std::forward<F>(future).get(); });\n}\n
\n\n\n

Task

\n

Task<R(F...)> is a type-erased callable object similar to\nstd::function<R(F...)> but works for move-only types. It is similar to\nstd::package_task<R(F...)> but does not wrap the result in a std::future<R>\n(it is not packaged).

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
std::functionstd::packaged_tasksimgrid::xbt::Task
CopyableYesNoNo
MovableYesYesYes
Callconstnon-constnon-const
Callablemultiple timesonceonce
Sets a promiseNoYesNo
\n

It could be implemented as:

\n
template<class T>\nclass Task {\nprivate:\n  std::packaged_task<T> task_;\npublic:\n\n  template<class F>\n  void Task(F f) :\n    task_(std::forward<F>(f))\n  {}\n\n  template<class... ArgTypes>\n  auto operator()(ArgTypes... args)\n  -> decltype(task_.get_future().get())\n  {\n    task_(std::forward<ArgTypes)(args)...);\n    return task_.get_future().get();\n  }\n\n};\n
\n\n\n

but we don't need a shared-state.

\n

This is useful in order to bind move-only type arguments:

\n
template<class F, class... Args>\nclass TaskImpl {\nprivate:\n  F code_;\n  std::tuple<Args...> args_;\n  typedef decltype(simgrid::xbt::apply(\n    std::move(code_), std::move(args_))) result_type;\npublic:\n  TaskImpl(F code, std::tuple<Args...> args) :\n    code_(std::move(code)),\n    args_(std::move(args))\n  {}\n  result_type operator()()\n  {\n    // simgrid::xbt::apply is C++17 std::apply:\n    return simgrid::xbt::apply(std::move(code_), std::move(args_));\n  }\n};\n\ntemplate<class F, class... Args>\nauto makeTask(F code, Args... args)\n-> Task< decltype(code(std::move(args)...))() >\n{\n  TaskImpl<F, Args...> task(\n    std::move(code), std::make_tuple(std::move(args)...));\n  return std::move(task);\n}\n
\n\n\n

Upate (2018-08-15): there is a\nproposal\nfor including yhis as std::unique_function in the C++ standard.\nIn addition to the implementations listed in the paper, there is also\nfolly::Function\nor stdlab::task.\nThere is a later proposal\nfor extending std::function\nwith non-copyable move-only types and one shot call\nwith eg. std::function<void()&&>.

\n
\n
\n
    \n
  1. \n

    The relationship between the SimGrid simulation kernel and the simulated\nactors is similar to the relationship between an OS kernel and the OS\nprocesses: the simulation kernel manages (schedules) the execution of the\nactors; the actors make requests to the simulation kernel using simcalls.\nHowever, both the simulation kernel and the actors currently run in the same\nOS process (and use same address space).\u00a0\u21a9

    \n
  2. \n
  3. \n

    This is the kind of futures that are available in ECMAScript which use\nthe same kind of never-blocking asynchronous model as our discrete event\nsimulator.\u00a0\u21a9

    \n
  4. \n
  5. \n

    Currently, we did not implement some features such as shared\nfutures.\u00a0\u21a9

    \n
  6. \n
  7. \n

    You might want to compare this method with simgrid::kernel::Future::get()\nwe showed previously: the method of the kernel future does not block and\nraises an error if the future is not ready; the method of the actor future\nblocks after having set a continuation to wake the actor when the future\nis ready.\u00a0\u21a9

    \n
  8. \n
  9. \n

    (which are related to the fact that we are in a non-blocking single-threaded\nsimulation engine)\u00a0\u21a9

    \n
  10. \n
  11. \n

    In the C++ standard library, std::future<T> is used by the consumer\nof the result. On the other hand, std::promise<T> is used by the\nproducer of the result. The consumer calls promise.set_value(42)\nor promise.set_exception(e) in order to set the result which will\nbe made available to the consumer by future.get().\u00a0\u21a9

    \n
  12. \n
  13. \n

    Calling the continuations from simulation loop means that we don't have\nto fear problems like invariants not being restored when the callbacks\nare called \"\ud83d\ude28\" or stack overflows triggered by deeply nested\ncontinuations chains \"\ud83d\ude30\". The continuations are all called in a\nnice and predictable place in the simulator with a nice and predictable\nstate \"\ud83d\ude0c\".\u00a0\u21a9

    \n
  14. \n
  15. \n

    std::lock() might kinda work too but it may not be such as good idea to\nuse it as it may use a deadlock avoidance algorithm such as\ntry-and-back-off.\nA backoff would probably uselessly wait in real time instead of simulated\ntime. The deadlock avoidance algorithm might as well add non-determinism\nin the simulation which we would like to avoid.\nstd::try_lock() should be safe to use though.\u00a0\u21a9

    \n
  16. \n
  17. \n

    There's an interesting library implementation in\nRust as well.\u00a0\u21a9

    \n
  18. \n
\n
"}, {"id": "http://www.gabriel.urdhr.fr/2016/03/25/cloc-with-flamegraph/", "title": "Number of lines of code with FlameGraph", "url": "https://www.gabriel.urdhr.fr/2016/03/25/cloc-with-flamegraph/", "date_published": "2016-03-25T00:00:00+01:00", "date_modified": "2016-03-25T00:00:00+01:00", "tags": ["computer", "simgrid"], "content_html": "

FlameGraph\nis used to display stack trace samples but we can ue it for\nother purposes as well.

\n

For example, we can quite simply display where are the lines of code\nof a project:

\n
cloc --csv-delimiter=\"$(printf '\\t')\" --by-file --quiet --csv src/ include/ |\nsed '1,2d' |\ncut -f 2,5 |\nsed 's/\\//;/g' |\n./flamegraph.pl\n
\n\n\n
\n\n \"\"\n\n
Number of lines of code in SimGrid
\n
\n\n

References

\n"}, {"id": "http://www.gabriel.urdhr.fr/2015/11/25/rr-use-after-free/", "title": "Debugging use-after-free with RR reverse execution", "url": "https://www.gabriel.urdhr.fr/2015/11/25/rr-use-after-free/", "date_published": "2015-11-25T00:00:00+01:00", "date_modified": "2015-11-25T00:00:00+01:00", "tags": ["computer", "debug", "gdb", "rr", "simgrid"], "content_html": "

RR is a very useful tool for debugging. It\ncan record the execution of a program and then replay the exact same\nexecution at will inside a debugger. One very useful extra power\navailable since 4.0 is the support for efficient reverse\nexecution\nwhich can be used to find the root cause of a bug in your program\nby rewinding time. In this example, we reverse-execute a program from a\ncase of use-after-free in order to find where the block of memory was\nfreed.

\n

TLDR

\n
\n$ rr record ./foo my_args\n$ rr replay\n(rr) continue\n(rr) break free if $rdi == some_address\n(rr) reverse-continue\n
\n\n

Problem

\n

We have a case of use-after-free:

\n
$ gdb --args java -classpath \"$classpath\" surfCpuModel/TestCpuModel \\\n  small_platform.xml surfCpuModelDeployment.xml \\\n  --cfg=host/model:compound\n\n(gdb) run\n[\u2026]\n\nProgram received signal SIGSEGV, Segmentation fault.\n[Switching to Thread 0x7ffff7fbb700 (LWP 12766)]\n0x00007fffe4fe3fb7 in xbt_dynar_map (dynar=0x7ffff0276ea0, op=0x56295a443b6c65) at /home/gabriel/simgrid/src/xbt/dynar.c:603\n603     op(elm);\n\n(gdb) p *dynar\n$2 = {size = 2949444837771837443, used = 3415824664728436765,\n      elmsize = 3414970357536090483, data = 0x646f4d2f66727573,\n      free_f = 0x56295a443b6c65}\n
\n\n

The fields of this structure are all wrong and we suspect than this\nblock of heap was already freed and reused by another allocation.

\n

We could use GDB with a conditional breakpoint of free(ptr) with\nptr == dynar but this approach poses a few problems:

\n
    \n
  1. \n

    in the new execution of the program this address might be\n completely different because of different source of indeterminism\n such as,

    \n
  2. \n
  3. \n

    ASLR which we could disable with setarch -R,

    \n
  4. \n
  5. \n

    scheduling of the different threads (and Java usually spawns quite\n a few threads);

    \n
  6. \n
  7. \n

    there could be a lot of calls of free() for this specific\n address for previous allocations before we reach the correct one.

    \n
  8. \n
\n

Using RR

\n

Deterministic recording

\n

RR can be used to create a recording of a given execution of the\nprogram. This execution can then be replayed exactly inside a\ndebugger. This fixes our first problem.

\n

Let's record our crash in RR:

\n
$ rr record java -classpath \"$classpath\" surfCpuModel/TestCpuModel \\\n  small_platform.xml surfCpuModelDeployment.xml \\\n    --cfg=host/model:compound\n[\u2026]\n# A fatal error has been detected by the Java Runtime Environment:\n[\u2026]\n
\n\n

Now we can replay the exact same execution over and over gain in a special\nGDB session:

\n
$ rr replay\n(rr) continue\nContinuing.\n[\u2026]\n\nProgram received signal SIGSEGV, Segmentation fault.\n[Switching to Thread 12601.12602]\n0x00007fe94761efb7 in xbt_dynar_map (dynar=0x7fe96c24f350, op=0x56295a443b6c65) at /home/gabriel/simgrid/src/xbt/dynar.c:603\n603     op(elm);\n
\n\n

Reverse execution to the root cause of the problem

\n

We want to know who freed this block of memory. RR 4.0 provides\nsupport for efficient reverse-execution which can be used to solve our\nsecond problem.

\n

Let's set a conditional breakpoint on free():

\n
(rr) p dynar\n$1 = (const xbt_dynar_t) 0x7fe96c24f350\n\n(rr) break free if $rdi == 0x7fe96c24f350\n
\n\n

Note: This is for x86_64.\nIn the x86_64 ABI,\nthe RDI register is used to pass the first parameter.

\n

Now we can use RR super powers by reverse-executing the program until\nwe find who freed this block of memory:

\n
\n(rr) reverse-continue\nContinuing.\nProgram received signal SIGSEGV, Segmentation fault.\n[\u2026]\n\n(rr) reverse-continue\nContinuing.\nBreakpoint 1, __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n2917    malloc.c: Aucun fichier ou dossier de ce type.\n\n(bt) backtrace\n#0  __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n#1  0x00007fe96b18486d in ZIP_FreeEntry (jz=0x7fe96c0f43d0, ze=0x7fe96c24f6e0) at ../../../src/share/native/java/util/zip/zip_util.c:1104\n#2  0x00007fe968191d78 in ?? ()\n#3  0x00007fe96818dcbb in ?? ()\n#4  0x0000000000000002 in ?? ()\n#5  0x00007fe96c24f6e0 in ?? ()\n#6  0x000000077ab0c2d8 in ?? ()\n#7  0x00007fe970641a80 in ?? ()\n#8  0x0000000000000000 in ?? ()\n\n(rr) reverse-continue\nContinuing.\nBreakpoint 1, __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n2917    in malloc.c\n\n(rr) backtrace\n#0  __GI___libc_free (mem=0x7fe96c24f350) at malloc.c:2917\n#1  0x00007fe94761f28e in xbt_dynar_to_array (dynar=0x7fe96c24f350) at /home/gabriel/simgrid/src/xbt/dynar.c:691\n#2  0x00007fe946b98a2f in SwigDirector_CpuModel::createCpu (this=0x7fe96c14d850, name=0x7fe96c156862 \"Tremblay\", power_peak=0x7fe96c24f350, pstate=0, \n    power_scale=1, power_trace=0x0, core=1, state_initial=SURF_RESOURCE_ON, state_trace=0x0, cpu_properties=0x0)\n    at /home/gabriel/simgrid/src/bindings/java/org/simgrid/surf/surfJAVA_wrap.cxx:1571\n#3  0x00007fe947531615 in cpu_parse_init (host=0x7fe9706456d0) at /home/gabriel/simgrid/src/surf/cpu_interface.cpp:44\n#4  0x00007fe947593f88 in sg_platf_new_host (h=0x7fe9706456d0) at /home/gabriel/simgrid/src/surf/sg_platf.c:138\n#5  0x00007fe9475e54fb in ETag_surfxml_host () at /home/gabriel/simgrid/src/surf/surfxml_parse.c:481\n#6  0x00007fe9475da1dc in surf_parse_lex () at src/surf/simgrid_dtd.c:7093\n#7  0x00007fe9475e84f2 in _surf_parse () at /home/gabriel/simgrid/src/surf/surfxml_parse.c:1068\n#8  0x00007fe9475e8cfa in parse_platform_file (file=0x7fe96c14f1e0 \"/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml\")\n    at /home/gabriel/simgrid/src/surf/surfxml_parseplatf.c:172\n#9  0x00007fe9475142f4 in SIMIX_create_environment (file=0x7fe96c14f1e0 \"/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml\")\n    at /home/gabriel/simgrid/src/simix/smx_environment.c:39\n#10 0x00007fe9474cd98f in MSG_create_environment (file=0x7fe96c14f1e0 \"/home/gabriel/simgrid/examples/java/../platforms/small_platform.xml\")\n    at /home/gabriel/simgrid/src/msg/msg_environment.c:37\n#11 0x00007fe94686c473 in Java_org_simgrid_msg_Msg_createEnvironment (env=0x7fe96c00a1d8, cls=0x7fe9706459a8, jplatformFile=0x7fe9706459b8)\n    at /home/gabriel/simgrid/src/bindings/java/jmsg.c:203\n#12 0x00007fe968191d78 in ?? ()\n#13 0x00000007fffffffe in ?? ()\n#14 0x00007fe970645958 in ?? ()\n#15 0x00000007f5cd1100 in ?? ()\n#16 0x00007fe9706459b8 in ?? ()\n#17 0x00000007f5cd1738 in ?? ()\n#18 0x0000000000000000 in ?? ()\n
\n\n

Now that we have found the offending free() call we can inspect the state\nof the program:

\n
\n(rr) frame 1\n#1  0x00007fe94761f28e in xbt_dynar_to_array (dynar=0x7fe96c24f350) at /home/gabriel/simgrid/src/xbt/dynar.c:691\n691   free(dynar);\n\n(rr) list\n686 {\n687   void *res;\n688   xbt_dynar_shrink(dynar, 1);\n689   memset(xbt_dynar_push_ptr(dynar), 0, dynar->elmsize);\n690   res = dynar->data;\n691   free(dynar);\n692   return res;\n693 }\n694\n695 /** @brief Compare two dynars\n
\n\n

If necessary we could continue reverse-executing in order to understand\nbetter what caused the problem.

\n

Using GDB

\n

While GDB has builtin support for reverse\nexecution,\ndoing the same thing in GDB is much slower. Moreover, recording\nthe execution fills the GDB record buffer quite rapidly which prevents\nus from recording a large execution: with the native support of GDB\nwe would probably need to narrow down the region when the bug appeared\nin order to only record (and the reverse-execute) a small part of the\nexecution of the program.

\n

References

\n"}, {"id": "http://www.gabriel.urdhr.fr/2015/09/01/simgrid-mc-rewrite/", "title": "SimGridMC: The Big Split (and Cleanup)", "url": "https://www.gabriel.urdhr.fr/2015/09/01/simgrid-mc-rewrite/", "date_published": "2015-09-01T00:00:00+02:00", "date_modified": "2015-09-01T00:00:00+02:00", "tags": ["computer", "simgrid", "system"], "content_html": "

In my previous SimGrid post, I\ntalked about different solutions for a better isolation between the\nmodel-checked application and the model-checker. We chose to avoid\nthe (hackery) solution based multiple dynamic-linker namespaces in the\nsame process and use a more conventional process-based isolation.

\n

Table of Content

\n
\n\n
\n

Motivation

\n

In the previous version of the SimGridMC, the model-checker was\nrunning in the same process as the main SimGrid application. We had in\nthe same process:

\n\n

Multiple heaps

\n

In order to do this, the SimGridMC process was using two different\nmalloc()-heaps in the same process in order to separate:

\n
    \n
  1. \n

    the state of the simulated application (processes states and global\n state);

    \n
  2. \n
  3. \n

    the state of the model-checker.

    \n
  4. \n
\n

The model-checker code had a lot of code to select which heap had to\nbe active (and used by malloc() and friends) at a given point of the\ncode.

\n

This is an example of a function with a lot of heap management calls\n(the lines managing the heap swapping are commented with <*>):

\n
void MC_pre_modelcheck_safety()\n{\n\n  int mc_mem_set = (mmalloc_get_current_heap() == mc_heap);  // <*>\n\n  mc_state_t initial_state = NULL;\n  smx_process_t process;\n\n  /* Create the initial state and push it into the exploration stack */\n  if (!mc_mem_set)                                           // <*>\n    MC_SET_MC_HEAP;                                          // <*>\n\n  if (_sg_mc_visited > 0)\n    visited_states = xbt_dynar_new(sizeof(mc_visited_state_t),\n      visited_state_free_voidp);\n\n  initial_state = MC_state_new();\n\n  MC_SET_STD_HEAP;                                           // <*>\n\n  /* Wait for requests (schedules processes) */\n  MC_wait_for_requests();\n\n  MC_SET_MC_HEAP;                                            // <*>\n\n  /* Get an enabled process and insert it in the interleave set\n     of the initial state */\n  xbt_swag_foreach(process, simix_global->process_list) {\n    if (MC_process_is_enabled(process)) {\n      MC_state_interleave_process(initial_state, process);\n      if (mc_reduce_kind != e_mc_reduce_none)\n        break;\n    }\n  }\n\n  xbt_fifo_unshift(mc_stack, initial_state);\n\n  if (!mc_mem_set)                                           // <*>\n    MC_SET_STD_HEAP;                                         // <*>\n}\n
\n\n\n

The heap management code was cumbersome and difficult to maintain: it\nwas necessary to known which function had to be called in each\ncontext, which function was selecting the correct heap and select the\ncurrent heap accordingly. It was moreover necessary to known which\ndata was allocated in which heap. Failing to use the correct heap\ncould lead to errors such as:

\n\n

Goals and solutions

\n

While this design was interesting for the performance of the\nmodel-checker, it was quite difficult to maintain and understand. We\nwanted to create a new version of the model-checker which would be\nsimpler to understand and maintain:

\n\n

In order to avoid the coexistence of the two heaps we envisioned two\npossible solutions:

\n\n

While the dynamic-linker based solution is quite interesting and would\nprovide better performance by avoiding context switches (and who\ndoesn't want to write their own dynamic linker?), it would probably be\ndifficult to achieve and would probably not make the code easier to\nunderstand.

\n

We chose to use the much more standard solution of using different\nprocesses which is conceptually much simpler and provides a better\nisolation between the model-checker and the model-checked application.\nWith this design, the model-checker is a quite standard process: all\ndebugging tools can be used without any problem (Valgrind, GDB) on the\nmodel-checker process. The model-checked process is not completely\nstandard as we are constantly overwriting its state but we can still\nptrace it and use a debugger.

\n

Update (2016-04-01): the model-checker now ptraces the\nmodel-checked application (for various reasons) and it is not possible\nto debug the model-checked application anymore. However, we have a\nfeature to replay an execution of the model-checked application\noutside of the model-checker.

\n

Splitting the model-checker and the simulator

\n

In this new design, the model-checker process behaves somehow like a\ndebugger for the simulated (model-checked) application by monitoring\nand controlling its execution. The model-checker process is\nresponsible for:

\n\n

The simulated application is responsible for:

\n\n

Two mechanisms are used to implement the interaction between the\nmodel-checker process and the model-checked application:

\n\n

Since Linux 3.2, it is possible to read from and write to another\nprocess virtual\nmemory\nwithout ptrace()-ing it: I took care not to use ptrace() in order\nto be able to use it from another purpose (a process can only be\nptraced by a single process at a time):

\n\n

The split has been done in two phases:

\n
    \n
  1. \n

    In the first phase, the split process mode was implemented but the\n single-process mode was still still present and enabled by\n default. This allowed to detect regressions with the single-process\n mode and compare both modes of operations. The resulting code was\n quite ugly because it had to handle both modes of operations.

    \n
  2. \n
  3. \n

    When the split process mode was complete and working correctly, the\n single-process mode was removed and a lot of cleanup could be done.

    \n
  4. \n
\n

Explicit communications

\n

The model-checker process and the model-checked process application\ncommunicate with each other over a UNIX datagram socket. This socket\nis created by the model-checker and passed to the child model-checked\nprocess.

\n

This is used in the initialisation:

\n\n

This is used in runtime to control the execution of the model-checked\napplication:

\n\n

The (simplified) client-loop looks like this:

\n
void MC_client_main_loop(void)\n{\n  while (1) {\n    message_type message;\n    size_t = receive_message(&message);\n    switch(message.type()) {\n\n    // Executes a simcall:\n    case MC_MESSAGE_SIMCALL_HANDLE:\n      execute_transition(message.transition());\n      send_message(MC_MESSAGE_WAITING);\n      break;\n\n    // Execute application code until a visible simcall is reached:\n    case MC_MESSAGE_CONTINUE:\n      execute_application_code();\n      send_message(MC_MESSAGE_WAITING);\n      break;\n\n    // [...] (Other messages here)\n    }    \n  }\n}\n
\n\n\n

Each model-checking algorithm (safety, liveness, communication\ndeterminism) is implemented as model-checker side code which triggers\nexecution of model-checked-side transitions with:

\n
// Execute a simcall (MC_MESSAGE_SIMCALL_HANDLE):\nMC_simcall_handle(req, value);\n\n// Execute simulated application code (MC_MESSAGE_CONTINUE):\nMC_wait_for_requests();\n
\n\n\n

The communication determinism algorithm needs to see the result of\nsome simcalls before triggering the application code:

\n
MC_simcall_handle(req, value);\nMC_handle_comm_pattern(call, req, value, communication_pattern, 0);\nMC_wait_for_requests();\n
\n\n\n

Snapshot/restore

\n

Snapshot and restoration is handled by reading/writing the\nmodel-checked process memory with /proc/$pid/memory. During this\noperation, the model-checked process is waiting for messages on a\nspecial stack dedicated to the simulator (which is not managed by the\nsnapshotting logic). During this time, the model-checked application\nis not supposed to be accessing the simulated application memory.\nWhen this is finished, the model-checker wakes up the simulated\napplication with the MC_MESSAGE_SIMCALL_HANDLE and\nMC_MESSAGE_CONTINUE.

\n

Peeking at the state of the model-checked application

\n

The model-checker needs to read some of the state of the simulator\n(state of the communications, name of the processes and so on).\nCurrently this is handled quite brutally by reading the data directly\nin the structures of the model-checked process (following linked-list\nitems, arrays elements, etc. from the remote process):

\n
// Read the hostname from the MCed process:\nprocess->read_bytes(&host_copy, sizeof(host_copy), remote(p->host));\nint len = host_copy.key_len + 1;\nchar hostname[len];\nprocess->read_bytes(hostname, len, remote(host_copy.key));\ninfo->hostname = mc_model_checker->get_host_name(hostname);\n
\n\n\n

This is quite ugly and should probably be replaced by some more\nstructured way to share this information in the future.

\n

Impact on the user interface

\n

We now have a simgrid-mc executable for the model-checker process.\nIt must be called explicitly by the user in order to use the\nmodel-checker (similarly to gdb or other debugging tools):

\n
# Running the raw application:\n./bugged1\n\n# Running the application in GDB:\ngdb --args ./bugged1\n\n# Running the application in valgrind:\nvalgrind ./bugged1\n\n# Running the application in SimgridMC:\nsimgrid-mc ./bugged1\n
\n\n\n

For SMPI applications, the --wrapper argument of smpirun must be\nused:

\n
# Running the raw application:\nsmpirun \\\n  -hostfile hostfile -platform platform.xml \\\n  --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \\\n  --cfg=network/TCP_gamma:4194304 \\\n  -np 4 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \\\n  --cfg=contexts/factory:ucontext --cfg=contexts/stack_size:4 \\\n  ./dup\n\n# Running the application in GDB:\nsmpirun -wrapper \"gdb --args\" \\\n  -hostfile hostfile -platform platform.xml \\\n  --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \\\n  --cfg=network/TCP_gamma:4194304 \\\n  -np 4 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \\\n  --cfg=contexts/factory:ucontext --cfg=contexts/stack_size:4 \\\n  ./dup\n\n# Running the application in valgrind:\nsmpirun -wrapper \"valgrind\" \\\n  -hostfile hostfile -platform platform.xml \\\n  --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \\\n  --cfg=network/TCP_gamma:4194304 \\\n  -np 4 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \\\n  --cfg=contexts/factory:ucontext --cfg=contexts/stack_size:4 \\\n  ./dup\n\n# Running the application in SimgridMC:\nsmpirun -wrapper \"simgrid-mc\" \\\n  -hostfile hostfile -platform platform.xml \\\n  --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \\\n  --cfg=network/TCP_gamma:4194304 \\\n  -np 4 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \\\n  --cfg=contexts/factory:ucontext --cfg=contexts/stack_size:4 \\\n  ./dup\n
\n\n\n

Under the hood, simgrid-mc sets a a few environment variable for its\nchild process:

\n\n

Cleanup

\n

After implementing the separate mode, the single process mode has been\nremoved in order to have a cleaner code. In order to have the two\nmode of operations coexist, many functions were checking the mode\noperation and the behaviour was changing depending on the mode. Most\nof this code has been removed and is now much simpler.

\n

The code managing the two heaps is now useless and has been completely\nremoved. We are still using our custom heap implementation in the\nmodel-checked application however: we are using its internal\nrepresentation to track the different allocations in the heap; it is\nused as well in order to clear the bytes of an allocation before\ngiving it to the application. The model-checked application however\nis a quite standard application and uses the standard system heap\nimplementation (or could use another implementation) which is expected\nto have better performance than our implementation.

\n

Currently, it is not quite clear which part of the API are intended to\nbe used by the model-checked process, which part are to be used by the\nmodel-checker process and which parts can be used by both parts. Some\neffort has been used to separate the different parts of the API (by\nmoving them in different header files) but this is is still an ongoing\nprocess. In the future, we might want to have a better organisation\nusing different header files, namespaces and possibly different\nshared-objects for the different parts of the API.

\n

A longer term goal, would be to have a nice API for the model-checker\nwhich could easily be used by the users to write their own\nmodel-checker algorithms (in their own executables). We might even\nwant to export a Lua based binding to write the model-checker\nalgorithms.

\n

Conversion to C++

\n

In parallel, the model-checker code has been ported to C++ and a part\nof it has been rewritten in a more idiomatic C++:

\n\n

All the MC code has been converted to C++ but the conversion to\nidiomatic C++ is still ongoing: some parts of the code are still using\nC idioms.

\n

Performance

\n

This first version is quite slower than the previous one. It was\nexpected that the new implementation would be slower than the previous\none because it uses cross-process communications and the old version\nhad been heavily optimised.\nHowever this might be optimised in the future in order to minimise the\noverhead of cross process synchronisations.

\n

Conclusion

\n

This is a first step towards a cleaner and simpler SimGridMC. The heap\njuggling code has been removed. Instead however, we have some code\nwhich reads directly in the data structures of the other process: this\ncode is not so nice and not so maintainable and we will probably want\nto find a better way to do this.

\n

Some things still need to be done:

\n"}, {"id": "http://www.gabriel.urdhr.fr/2015/01/06/simgrid-mc-isolation/", "title": "Better isolation for SimGridMC", "url": "https://www.gabriel.urdhr.fr/2015/01/06/simgrid-mc-isolation/", "date_published": "2015-01-06T00:00:00+01:00", "date_modified": "2015-01-06T00:00:00+01:00", "tags": ["simgrid", "system", "computer", "linker", "linux", "simulation", "elf"], "content_html": "

In an attempt to simplify the development around the SimGrid\nmodel-checker, we were thinking about moving the model-checker out in\na different process. Another different approach would be to use a\ndynamic-linker isolation of the different components of the process.\nHere's a summary of the goals, problems and design issues surrounding\nthese topics.

\n

Table of Content

\n
\n\n
\n

Current state

\n

SimGrid architecture

\n

The design if the SimGrid simulator is based on the design of a\noperating system.

\n

In a typical OS, we have a kernel managing a global state and a\nseveral userspace processes running on top of the kernel. The kernel\nschedules the execution of the different processes (and their\nthreads) on the available CPUs. The kernel provides an API to the\nprocesses made of several system calls.

\n
\n \n \n Process\n \n Process\n \n Process\n \n Process\n\n \n System calls\n \n OS kernel\n \n
\n\n

SimGrid simulated a distributed system: it simulates a network and let\nthe different processes of the simulated system use this simulated\nnetwork. Each simulated process runs on top of the SimGrid kernel.\nThe SimGrid kernel schedules the execution of the different processes\non the available OS threads. The SimGrid kernel provides an API to the\nprocesses made of several simulation calls.

\n
\n \n \n Process\n \n Process\n \n Process\n \n Process\n \n Simulation calls\n\n \n SimGrid kernel\n \n
\n\n

In order to reduce the cost of context switching between the different\nprocesses, in the current implementation of SimGrid all the simulated\nprocesses and the SimGrid kernel are in the same OS process: there is\nno MMU-enforced separation of memory between the simulated processes\nbut they are expected to only communicate between each other using\nonly the means provided by the SimGrid kernel (the simulation calls)\nand should not share mutable memory.

\n
\n \n \n \n Process\n \n Process\n \n Process\n \n Process\n\n \n Simulation calls\n \n SimGrid kernel\n \n \n \n \n System calls\n \n OS kernel\n \n \n
\n\n

The SimGrid kernel has a dedicated stack and each simulated process has its\nown stack: cooperative multitasking (fibers, ucontext) is used to\nswitch between the different contexts (SimGrid kernel/process) and is\nused by the SimGrid kernel to schedule the execution of the different\nprocesses.

\n

The same (libc) heap is shared between the SimGrid kernel and the\nsimulated processes.

\n

SimGridMC architecture

\n

The SimGrid model-checker is a dynamic analysis component for SimGrid.\nIt explores the different possible interleavings of execution of the\nsimulated processes (depending on the execution of their transitions\ni.e. the different possible orderings of their communications).

\n

In order to do this, the MC saves at each node of the graph of the\npossible executions the state of the system:

\n\n

Those states are then used to:

\n\n

In the current implementation, the model-checker lives in the same\nprocess as the main SimGrid process (the SimGrid kernel and the\nprocesses):

\n
\n \n \n \n\n \n Process\n \n Process\n \n Process\n \n Process\n \n Simulation calls\n \n SimGrid kernel\n \n Model-checker\n\n \n \n
\n\n

Multiple heaps

\n

However, the model-checker needs to maintain its own\nstate: the state of the model-checker must not be saved, compared and\nrestored with the rest of the state.

\n

In order to do this, the state of the model-checker is maintained in a\nsecond heap:

\n\n

This is implemented by overriding the malloc(), free() and friends\nin order to support multiple heap. A global variable is used to choose\nthe current working heap:

\n
// Simplified code\nxbt_mheap_t __mmalloc_current_heap = NULL;\n\nvoid *malloc(size_t n)\n{\n  return mmalloc(__mmalloc_current_heap, n);\n}\n\nvoid free(void *ptr)\n{\n  return mfree(__mmalloc_current_heap, ptr);\n}\n
\n\n\n

Limitation of the approach

\n

The current implementation is complicated and not easy to understand and\nmaintain:

\n\n

A first motivation for modifying the architecture of SimGridMC, is to incraase\nthe maintainability of the SimGridMC codebase.

\n

Another related goal is to simplify the debugging experience (of the simulated\napplication, the SimGrid kernel and the model-checker). For example, the current\nversion of SimGridMC does not work under valgrind. A solution which would\nprovide a more powerful debugging experience would be a valuable tool for the\nSimGridMC devs but more importantly for the users of SimGridMC.

\n

Process-based isolation

\n

For all these reasons, we would like to move the model-checker in a\nseparate process: a model-checker process maintains the model-checker\nstate and control the execution of a model-checked process.

\n
\n \n \n \n Process\n \n Process\n \n Process\n \n Process\n\n \n Simulation calls\n \n SimGrid kernel\n \n \n \n \n Model-checking interface\n \n Model-checker\n \n \n
\n\n

Memory snapshot/restoration

\n

The snapshoting/restoration of the model-checked process memory can be\ndone using /proc/${pid}/mem or process_vm_readv() and\nprocess_vm_writev().

\n

As long as the OS threads are living on stacks which are not managed\nby the state snapshot/restoration mechanism, they will not be\naffected: we must take care that the OS threads switch to unmanaged\nstacks when we are doing the state snapshots/restorations.

\n

Another solution would be to use ptrace() with PTRACE_GETREGSET\nand PTRACE_SETREGSET in order to snapshot/restore the registers of\neach thread but we would like to avoid this in order to be able to use\nptrace() for debugging or other\npurposes.

\n

File descriptors restoration

\n

Linux does not provide a way to change the file descriptors of another process:\nthe restoration of the file descriptors must be done in the taret OS process\nand cannot be done from the model-checker process. Cooperation of the model-checked\nprocess is needed for the file descriptors restoration.

\n

We could abuse ptrace()-based syscall rewriting techniques or some\nsort of parasite injection in order to\nachieve this.

\n

Dynamic-linker based isolation

\n

Another idea would be to create a custom dynamic linker with namespace\nsupport in order to be able to link multiple instances of the same\nlibrary and provide isolation between different parts of the process.

\n

This could be used to:

\n\n

Prior art in DCE with dlmopen()

\n

It turns out that\nDCE\nalready uses a similar approach to load multiples application instances along\nwith Linux kernel implementations (and its network stack)\non top of the NS3 packet level network simulator\nin the same process:\nthe applications and Linux kernel are compiled as shared objects, the latter\nforming a Library OS liblinux.so shared object\nand loaded multiple times in the same process alongside with the NS3 instance.

\n

Among several alternative\nstrategies,\nDCE uses the dlmopen()\nfunction. This is a variant of\ndlopen() originating from\nSunOS/Solaris and\nimplemented on the GNU userland which allows to load dynamic libraries in\nseparated namespaces:

\n\n

An alternative implementation of the ld.so dynamic linker,\nelf-loader, is used which\nprovides additional\nfeatures:

\n\n

More information about dlmopen()\ncan be found in old version of Sun\nLinkers and Libraries Guide.

\n

A custom dynamic loader/linker on top of libc

\n

However, I was envisioning something slightly different: instead of\nwriting a replacement of ld.so (using raw system calls), I was\nthinking about building the custom dynamic linker on top of libc and\nlibdl in order to be able to use libc (malloc()), libdl and\nlibelf instead of using the raw system calls.

\n

Impact on debuggability

\n

In a split process design, the model-checker could be a quite standard\napplication avoiding weird hacks (such as introspection with /proc/self/maps and\nDWARF, snapshoting/restoration of the state with memcpy(), custom mmalloc()\nimplementation with multiple heaps). Once a relevant trajectory of the\nmodel-checked application has been identified, it could be replayed outside of\nthe model-checker and debugged in this simpler mode.

\n

However, having a single process could lead to a better debugging experience:\nby being able to combines breakpoints in the model-checker, the SimGrid kernel\nand the simulated application with conditions spanning all those components.

\n

At the same time,\nusing multiple dynamic-linking namespaces could make the debugging\nexperience more complicated. I'm not sure how well it is supported by the\ndifferent available debugging tools. The DCE tools seems to show that it is\nreasonably well supported by\nGDB\nand\nvalgrind.

\n

Conclusion

\n

So we have two possible directions:

\n\n

The first solution provides a better isolation of the model-checker.\nThe second solution is closer to the current implementation and\nshould have better performances by avoiding the context switches and\nIPC in favour of direct memory access and function calls. Moreover, the\ndynamic-linker-based isolation could be reused for other parts of the\nprojects (such as the isolation of the simulated MPI processes).

\n

It is not clear which solution would provide the better debugging experience for\nthe user and which solution would be better for the maintainability of\nSimGridMC.

\n

Appendix: dlmopen() quick demo

\n

This simple program creates three new namespaces and loads libpthread in those\nnamespaces:

\n
#define _GNU_SOURCE\n#include <dlfcn.h>\n\n#include <unistd.h>\n\nint main(int argc, const char** argv)\n{\n  size_t i;\n  for (i=0; i!=3; ++i) {\n    void* x = dlmopen(LM_ID_NEWLM, \"libpthread.so.0\", RTLD_NOW);\n    if (!x)\n      return 1;\n  }\n  while(1) sleep(200000);\n  return 0;\n}\n
\n\n\n

We see that libpthread is loaded thrice. Each instance has its own libc\ninstance as well (and a fourth one is loaded for the main program):

\n
00400000-00401000 r-xp 00000000 08:06 7603474                            /home/myself/temp/a.out\n00600000-00601000 rw-p 00000000 08:06 7603474                            /home/myself/temp/a.out\n0173a000-0175b000 rw-p 00000000 00:00 0                                  [heap]\n7fca7ac7d000-7fca7ae1c000 r-xp 00000000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7ae1c000-7fca7b01c000 ---p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b01c000-7fca7b020000 r--p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b020000-7fca7b022000 rw-p 001a3000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b022000-7fca7b026000 rw-p 00000000 00:00 0\n7fca7b026000-7fca7b03e000 r-xp 00000000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b03e000-7fca7b23d000 ---p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b23d000-7fca7b23e000 r--p 00017000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b23e000-7fca7b23f000 rw-p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b23f000-7fca7b243000 rw-p 00000000 00:00 0\n7fca7b243000-7fca7b3e2000 r-xp 00000000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b3e2000-7fca7b5e2000 ---p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b5e2000-7fca7b5e6000 r--p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b5e6000-7fca7b5e8000 rw-p 001a3000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b5e8000-7fca7b5ec000 rw-p 00000000 00:00 0\n7fca7b5ec000-7fca7b604000 r-xp 00000000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b604000-7fca7b803000 ---p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b803000-7fca7b804000 r--p 00017000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b804000-7fca7b805000 rw-p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7b805000-7fca7b809000 rw-p 00000000 00:00 0\n7fca7b809000-7fca7b9a8000 r-xp 00000000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7b9a8000-7fca7bba8000 ---p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7bba8000-7fca7bbac000 r--p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7bbac000-7fca7bbae000 rw-p 001a3000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7bbae000-7fca7bbb2000 rw-p 00000000 00:00 0\n7fca7bbb2000-7fca7bbca000 r-xp 00000000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7bbca000-7fca7bdc9000 ---p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7bdc9000-7fca7bdca000 r--p 00017000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7bdca000-7fca7bdcb000 rw-p 00018000 08:01 2625992                    /lib/x86_64-linux-gnu/libpthread-2.19.so\n7fca7bdcb000-7fca7bdcf000 rw-p 00000000 00:00 0\n7fca7bdcf000-7fca7bf6e000 r-xp 00000000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7bf6e000-7fca7c16e000 ---p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7c16e000-7fca7c172000 r--p 0019f000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7c172000-7fca7c174000 rw-p 001a3000 08:01 2626010                    /lib/x86_64-linux-gnu/libc-2.19.so\n7fca7c174000-7fca7c178000 rw-p 00000000 00:00 0\n7fca7c178000-7fca7c17b000 r-xp 00000000 08:01 2626017                    /lib/x86_64-linux-gnu/libdl-2.19.so\n7fca7c17b000-7fca7c37a000 ---p 00003000 08:01 2626017                    /lib/x86_64-linux-gnu/libdl-2.19.so\n7fca7c37a000-7fca7c37b000 r--p 00002000 08:01 2626017                    /lib/x86_64-linux-gnu/libdl-2.19.so\n7fca7c37b000-7fca7c37c000 rw-p 00003000 08:01 2626017                    /lib/x86_64-linux-gnu/libdl-2.19.so\n7fca7c37c000-7fca7c39c000 r-xp 00000000 08:01 2625993                    /lib/x86_64-linux-gnu/ld-2.19.so\n7fca7c568000-7fca7c56b000 rw-p 00000000 00:00 0\n7fca7c59a000-7fca7c59c000 rw-p 00000000 00:00 0\n7fca7c59c000-7fca7c59d000 r--p 00020000 08:01 2625993                    /lib/x86_64-linux-gnu/ld-2.19.so\n7fca7c59d000-7fca7c59e000 rw-p 00021000 08:01 2625993                    /lib/x86_64-linux-gnu/ld-2.19.so\n7fca7c59e000-7fca7c59f000 rw-p 00000000 00:00 0\n7fffa8481000-7fffa84a2000 rw-p 00000000 00:00 0                          [stack]\n7fffa85f5000-7fffa85f7000 r-xp 00000000 00:00 0                          [vdso]\n7fffa85f7000-7fffa85f9000 r--p 00000000 00:00 0                          [vvar]\nffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]\n
\n\n\n

The new namespaces are probably not fully functional in this state:\nthere are probably conflicts to solve in the different instances. For example,\neach libc probably tries to manage the same heap with sbrk().

"}, {"id": "http://www.gabriel.urdhr.fr/2014/11/03/not-cleaning-the-stack/", "title": "Avoiding to clean the stack", "url": "https://www.gabriel.urdhr.fr/2014/11/03/not-cleaning-the-stack/", "date_published": "2014-11-03T00:00:00+01:00", "date_modified": "2014-11-03T00:00:00+01:00", "tags": ["computer", "simgrid", "compilation", "assembly", "x86_64"], "content_html": "

In two previous posts, I looked into cleaning the stack frame of a\nfunction before using it by adding assembly at the beginning of each\nfunction. This was done either by modifying LLVM with a custom\ncodegen pass or by\nrewriting the\nassembly\nbetween the compiler and the assembler. The current implementation\nadds a loop at the beginning of every function. We look at the impact\nof this modification on the performance on the application.

\n

Update: this is an updated version of the post with fixed\ncode and updated results (the original version of the code was\nbroken).

\n

Initial results

\n

Here are the initial results:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TestNormalStack cleaning
ctest (complete testsuite)348.06s387.53s
ctest -R mc-bugged1-liveness-visited-ucontext-sparse1.53s2.00s
run_test comm dup 442.54s127.80s
\n

On big problems, the overhead of the stack-cleaning modification\nbecomes very important.

\n

Optimisation

\n

We would like to avoid the overhead of the stack-cleaning code. In order\nto do this we can use the following facts:

\n\n

Thus, we can disable stack-cleaning if we detect that we are not\nexecuting the application code. This can be implemented in two ways:

\n\n

In order to evaluate, the efficiency of this approach, we use a simple\ncomparison of %rsp with a constant value:

\n
    movq $0x7fff00000000, %r11\n    cmpq %r11, %rsp\n    jae .Lstack_cleaner_done0\n    movabsq $3, %r11\n.Lstack_cleaner_loop0:\n    movq    $0, -32(%rsp,%r11,8)\n    subq    $1, %r11\n    jne     .Lstack_cleaner_loop0\n.Lstack_cleaner_done0:\n    # Main code of the function goes here\n
\n\n\n

The value is hardcoded in this prototype but it could be loaded from a\nglobal variable instead.

\n

Here are the results with this optimisation:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TestNormalStack cleaning
ctest (complete testsuite)348.06s372.95s
ctest -R mc-bugged1-liveness-visited-ucontext-sparse1.53s1.53s
run_test comm dup 442.54s36.68s
\n

Appendix: reproducibility

\n

Those results were generated with:

\n
MAKEFLAGS=\"-j$(nproc)\"\n\ngit clone https://gforge.inria.fr/git/simgrid/simgrid.git\ngit checkout cd84ed2b393b564f5d8bfdaae60b814f81f24dc4\ncd simgrid\nsimgrid=\"$(pwd)\"\n\nmkdir build-normal\ncd build-normal\ncmake .. -Denable_model-checking=ON -Denable_documentation=OFF \\\n  -Denable_compile_warnings=ON -Denable_smpi_MPICH3_testsuite=ON\nmake $MAKEFLAGS\ncd ..\n\nmkdir build-zero\ncd build-zero\ncmake .. -Denable_model-checking=ON -Denable_documentation=OFF \\\n  -Denable_compile_warnings=ON -Denable_smpi_MPICH3_testsuite=ON \\\n  -DCMAKE_C_COMPILER=\"$simgrid/tools/stack-cleaner/cc\" \\\n  -DCMAKE_CXX_COMPILER=\"$simgrid/tools/stack-cleaner/c++\" \\\n  -DGFORTRAN_EXE=\"$simgrid/tools/stack-cleaner/fortran\"\nmake $MAKEFLAGS\ncd ..\n\nrun_test() {\n  (\n  platform=$(find $simgrid -name small_platform_with_routers.xml)\n  hostfile=$(find $simgrid | grep mpich3-test/hostfile$)\n\n  local base\n  base=$(pwd)\n  cd $base/teshsuite/smpi/mpich3-test/$1/\n\n  $base/bin/smpirun -hostfile $hostfile -platform $platform \\\n    --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI \\\n    --cfg=network/TCP_gamma:4194304 \\\n    -np $3 --cfg=model-check:1 \\\n    --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich \\\n    --cfg=contexts/factory:ucontext --cfg=model-check/max_depth:100000 \\\n    --cfg=model-check/reduction:none --cfg=model-check/visited:100000 \\\n    --cfg=contexts/stack_size:4 --cfg=model-check/sparse-checkpoint:yes \\\n    --cfg=model-check/soft-dirty:no ./$2 > /dev/null\n  )\n}\n
\n\n\n

The results without the optimisation are obtained by removing the\nrelevant assembly from the clean-stack-filter script.

"}, {"id": "http://www.gabriel.urdhr.fr/2014/10/06/cleaning-the-stack-in-a-llvm-pass/", "title": "Cleaning the stack in a LLVM pass", "url": "https://www.gabriel.urdhr.fr/2014/10/06/cleaning-the-stack-in-a-llvm-pass/", "date_published": "2014-10-06T00:00:00+02:00", "date_modified": "2014-10-06T00:00:00+02:00", "tags": ["computer", "simgrid", "llvm", "compilation", "assembly", "x86_64"], "content_html": "

In the previous episode, we implemented a LLVM pass which does\nnothing. Now we are trying to modify\nthis to create a (proof-of-concept) LLVM pass which fills the current\nstack frame with zero before using it.

\n

Table of Content

\n
\n\n
\n

Structure of the x86-64 stack

\n

Basic structure

\n

The top (in fact the bottom) of the stack is stored in the %rsp\nregister: a push operation decrements the value of %rsp and store\nthe value in the resulting address; conversely a pop operation\nincrements the value of %rsp. Stack variables are allocated by\ndecrementing %rsp.

\n

A function call (call) pushes the current value of the instruction\n(%rip) pointer on the stack. A return instruction (ret) pops a\nvalue from the stack into %rip.

\n

A typical call frame contains in order:

\n\n
\n
    \n
  1. parameter for f()
  2. \n
  3. parameter for f()
  4. \n
  5. return address to caller of f()
  6. \n
  7. local variable for f()
  8. \n
  9. local variable for f()
  10. \n\n
  11. parameter for g()
  12. \n
  13. parameter for g()
  14. \n
  15. return address to f() caller of g()
  16. \n
  17. local variable for g()
  18. \n
  19. local variable for g() \u2190 %rsp
  20. \n
  21. \n
  22. \n
\n
x86-64 stack structure for f()\n calls g()
\n
\n\n

For example this C code,

\n
int f();\n\nint main(int argc, char** argv) {\n  int i = 42;\n  f();\n  return 0;\n}\n
\n\n\n

is compiled (with clang -S -fomit-frame-poiner example.c) into this\n(using AT&T\nsyntax):

\n
main:\n    subq    $24, %rsp\n    movl    $0, 20(%rsp)\n    movl    %edi, 16(%rsp)\n    movq    %rsi, 8(%rsp)\n    movl    $42, 4(%rsp)\n    movb    $0, %al\n    callq   f\n    movl    $0, %edi\n    movl    %eax, (%rsp)\n    movl    %edi, %eax\n    addq    $24, %rsp\n    ret\n
\n\n\n

Memory is allocated on the stack using subq. Local variables are\nusually referenced by offsets from the stack pointer, OFFSET(%rsp).

\n

Frame pointer

\n

The x86 (32 bit) ABI uses the %rbp as the base of the stack. This is\nnot mandatory in the x86-64\nABI but the\ncompiler might still use a frame pointer. The base of the stack frame\nin stored in %rbp.

\n
\n
    \n
  1. parameter for f()
  2. \n
  3. parameter for f()
  4. \n
  5. return address to caller of f()
  6. \n
  7. saved %rbp from caller of f() \u2190 saved %rbp
  8. \n
  9. local variable for f()
  10. \n
  11. local variable for f()
  12. \n\n
  13. parameter for g()
  14. \n
  15. parameter for g()
  16. \n
  17. return address to f() caller of g()
  18. \n
  19. saved %rbp from f() \u2190 %rbp
  20. \n
  21. local variable for g()
  22. \n
  23. local variable for g() \u2190 %rsp
  24. \n
  25. \n
  26. \n
\n
x86-64 stack structure for f()\n calls g() with frame pointer
\n
\n\n

Here is the same program compiled with -fno-omit-frame-pointer:

\n
main:\n    pushq   %rbp\n    movq    %rsp, %rbp\n    subq    $32, %rsp\n    movl    $0, -4(%rbp)\n    movl    %edi, -8(%rbp)\n    movq    %rsi, -16(%rbp)\n    movl    $42, -20(%rbp)\n    movb    $0, %al\n    callq   f\n    movl    $0, %edi\n    movl    %eax, -24(%rbp)\n    movl    %edi, %eax\n    addq    $32, %rsp\n    popq    %rbp\n    ret\n
\n\n\n

When a frame pointer is used, stack memory is usually referenced as\nfixed offset from %rsp: OFFSET(%rsp).

\n

Red zone

\n

The x86 32-bit ABI did not allow the code of the function to use\nvariables after the top of the stack: a signal handler could at any\nmoment use any memory after the top of the stack.

\n

The standard x86-64\nABI allows the\ncode of the current function to use the 128 bytes (the red zone) after\nthe top the stack. A signal handler must be instantiated by the OS\nafter the red zone. The red zone can be used for temporary variables\nor for local variables for leaf functions (functions which do not call\nother functions).

\n
\n
    \n
  1. parameter for f()
  2. \n
  3. parameter for f()
  4. \n
  5. return address to caller of f()
  6. \n
  7. local variable for f()
  8. \n
  9. local variable for f()
  10. \n\n
  11. parameter for g()
  12. \n
  13. parameter for g()
  14. \n
  15. return address to f() caller of g()
  16. \n
  17. local variable for g()
  18. \n
  19. local variable for g() \u2190 %rsp
  20. \n\n
  21. red zone
  22. \n
  23. \u2026
  24. \n
  25. red zone
  26. \n\n
  27. \n
  28. \n
\n
x86-64 stack structure for f()\n calls g() (with the red zone)
\n
\n\n

Note: Windows systems do not use the standard x86-64 ABI: the\nusage of the register is different and there is no red zone.

\n

Let's make main() a leaf function:

\n
int main(int argc, char** argv) {\n  int i = 42;\n  return 0;\n}\n
\n\n\n

The variables are allocated in the red zone (negative offsets from the\nstack pointer):

\n
main:\n        movl    $0, %eax\n        movl    $0, -4(%rsp)\n        movl    %edi, -8(%rsp)\n        movq    %rsi, -16(%rsp)\n        movl    $42, -20(%rsp)\n        ret\n
\n\n\n

Cleaning the stack

\n

Assembly

\n

Here is the code we are going to add at the beginning of each\nfunction:

\n
    movq $QSIZE, %r11\n.Lloop:\n        movq $0, OFFSET(%rsp,%r11,8)\n        subq $1, %r11\n        jne  .Lloop\n
\n\n\n

for some suitable values of QSIZE and OFFSET.

\n

The %r11 is defined by the System V x86-64 ABI (as well as the\nWindows ABI) as a scratchpad register: at the beginning of the\nfunction we are free to use it without saving it first.

\n

LLVM pass

\n

This is implemented by a StackCleaner machine pass whose\nrunOnMachineFunction() works similarly to the NopInserter pass.

\n

Parameter computation

\n

We compute the parameters of the generate native code from the size of\nthe stack frame:

\n\n
int size = fn.getFrameInfo()->getStackSize();\nint qsize = size / sizeof(uint64_t);\nif (size==0) {\n  // No stack to clean, we do not modify the function:\n  return false;\n}\nint offset = - size - sizeof(uint64_t);\n
\n\n\n

Basic blocks

\n

For LLVM, a functions is represented as a collection\nof basic\nblocks. A basic block is a sequence of instructions where:

\n\n

Our assembly snippet is made of two basic blocks:

\n
    \n
  1. \n

    the first instruction;

    \n
  2. \n
  3. \n

    the end of the snippet.

    \n
  4. \n
\n
MachineBasicBlock* bb0 = fn.begin();\nMachineBasicBlock* bb1 = fn.CreateMachineBasicBlock();\nMachineBasicBlock* bb2 = fn.CreateMachineBasicBlock();\n\nfn.push_front(bb2);\nfn.push_front(bb1);\n
\n\n\n

A functions is a Control Flow Graph of basic blocks. We need to\ncomplete the arcs in this graph:

\n
bb1->addSuccessor(bb1);\nbb2->addSuccessor(bb2);\nbb2->addSuccessor(bb0);\n
\n\n\n

Machine instruction generation

\n

We generate the machine instructions:

\n
// First basic block (initialisation):\n\n// movq $QSIZE, %r11\nllvm::BuildMI(*bb1, bb1->end(), llvm::DebugLoc(), TII.get(llvm::X86::MOV64ri),\n  X86::R11).addImm(qsize);\n\n// Second basic block (.Lloop):\n\n// movq $0, OFFSET(%rsp,%r11,8)\nllvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::MOV64mi32))\n  .addReg(X86::RSP).addImm(8).addReg(X86::R11).addImm(offset).addReg(0)\n  .addImm(0);\n\n// subq $1, %r11\nllvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::SUB64ri8),\n  X86::R11)\n  .addReg(X86::R11)\n  .addImm(1);\n\n// jne  .Lloop\nllvm::BuildMI(*bb2, bb2->end(), llvm::DebugLoc(), TII.get(llvm::X86::JNE_4))\n  .addMBB(bb2);\n
\n\n\n

The instructions have suffix on the argument size and types:

\n\n

Modification notification

\n

The function has been modified:

\n
return true;\n
\n\n\n

Result

\n

Generated assembly

\n

Here is the generated assembly for our test code:

\n
main:\n    movabsq $3, %r11\n.LBB0_1:\n    movq    $0, -32(%rsp,%r11,8)\n    subq    $1, %r11\n    jne .LBB0_1\n    subq    $24, %rsp\n    movl    $0, 20(%rsp)\n    movl    %edi, 16(%rsp)\n    movq    %rsi, 8(%rsp)\n    movl    $42, 4(%rsp)\n    movb    $0, %al\n    callq   f\n    movl    $0, %edi\n    movl    %eax, (%rsp)\n    movl    %edi, %eax\n    addq    $24, %rsp\n    retq\n
\n\n\n

Test program

\n

Here is a simple test program using unitialized stack variables:

\n
#include <stdio.h>\n\nvoid f() {\n  int i;\n  int data[16];\n\n  for(i=0; i!=16; ++i)\n    printf(\"%i \", data[i]);\n  printf(\"\\n\");\n\n  for(i=0; i!=16; ++i)\n    data[i] = i;\n}\n\nvoid g() {\n  int i, j, k, l, m, n, o, p;\n  printf(\"%i %i %i %i %i %i %i %i\\n\", i, j, k, l, m, n, o, p);\n}\n\nint main(int argc, char** argv) {\n  f();\n  f();\n  g();\n  return 0;\n}\n
\n\n\n

This is the output of a normal compilation:

\n
-1 0 -812203224 32767 -406470232 32655 -400476992 32655 -400465496 32655 0 0 1 0 4195997 0\n0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\n16 0 0 15774463 15 14 13 12\n
\n\n

And with our stack-cleaning clang:

\n
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n0 0 0 0 0 0 0 0\n
\n\n

Result on SimGrid

\n

The whole SimGrid test suite works without compiling SimgridMC\nsupport.

\n

At this point, I discovered that SimGrid fails to run when compiled\nwith clang (or DragonEgg) with support for SimGridMC. I need to fix\nthis first before testing the impact of cleaning the stack on\nSimGridMC state comparison.

\n

In the next episode, I'll try another implementation of the same\nconcept using a few scripts in order to process the generated\nassembly between the compiler and the\nassembler\nwhich should work with a standard GCC and with SimGridMC.

\n

References

\n"}, {"id": "http://www.gabriel.urdhr.fr/2014/10/06/cleaning-the-stack-by-filtering-the-assembly/", "title": "Cleaning the stack by filtering the assembly", "url": "https://www.gabriel.urdhr.fr/2014/10/06/cleaning-the-stack-by-filtering-the-assembly/", "date_published": "2014-10-06T00:00:00+02:00", "date_modified": "2014-10-06T00:00:00+02:00", "tags": ["computer", "simgrid", "unix", "compilation", "assembly", "x86_64"], "content_html": "

In order to help the SimGridMC state comparison code, I wrote a\nproof-of-concept LLVM pass which cleans each stack\nframe before using\nit. However, SimGridMC currently does not work properly when compiled\nwith clang/LLVM. We can do the same thing by pre-processing the\nassembly generated by the compiler before passing it to the linker:\nthis is done by inserting a script between the compiler and the\nassembler. This script will rewrite the generated assembly by\nprepending stack-cleaning code at the beginning of each function.

\n

Table of Content

\n
\n\n
\n

Summary

\n

In typical compilation process, the compiler (here cc1) reads the\ninput source file and generates assembly. This assembly is then passed\nto the assembler (as) which generates native binary code:

\n
cat foo.c | cc1  | as      > foo.o\n#         \u2191      \u2191         \u2191\n#         Source Assembly  Native\n
\n\n\n

We can achieve our goal without depending of LLVM by adding a simple\nassembly-rewriting script to this pipeline between the the compiler\nand the assembler:

\n
cat foo.c | cc1  | clean-stack-filter | as     > foo.o\n#         \u2191      \u2191                    \u2191        \u2191\n#         Source Assembly             Assembly Native\n
\n\n\n

By doing this, our modification can be used for any compiler as long\nas it sends assembly to an external assembler instead of generating\nthe native binary code directly.

\n

This will be done in three components:

\n\n

Assembly rewriting script

\n

The first step is to write a simple UNIX program taking in input the\nassembly code of a source file and adding in output a stack-cleaning\npre-prolog.

\n

Here is the generated assembly for the test function of the previous\nepisode (compiled with GCC):

\n
main:\n.LFB0:\n    .cfi_startproc\n    subq    $40, %rsp\n    .cfi_def_cfa_offset 48\n    movl    %edi, 12(%rsp)\n    movq    %rsi, (%rsp)\n    movl    $42, 28(%rsp)\n    movl    $0, %eax\n    call    f\n    movl    $0, %eax\n    addq    $40, %rsp\n    .cfi_def_cfa_offset 8\n    ret\n    .cfi_endproc\n
\n\n\n

We can use .cfi_startproc to find the beginning of a function and\neach pushq and subq $x, %rsp instruction to estimate the stack\nsize used by this function (excluding the red zone and alloca() as\npreviously). Each time we are seeing the beginning of a function we\nneed to buffer each line until we are ready to emit the stack-cleaning\ncode.

\n
#!/usr/bin/perl -w\n# Transform assembly in order to clean each stack frame for X86_64.\n\nuse strict;\n$SIG{__WARN__} = sub { die @_ };\n\n# Whether we are still scanning the content of a function:\nour $scanproc = 0;\n\n# Save lines of the function:\nour $lines = \"\";\n\n# Size of the stack for this function:\nour $size = 0;\n\n# Counter for assigning unique ids to labels:\nour $id=0;\n\nsub emit_code {\n    my $qsize = $size / 8;\n    my $offset = - $size - 8;\n\n    if($size != 0) {\n      print(\"\\tmovabsq \\$$qsize, %r11\\n\");\n      print(\".Lstack_cleaner_loop$id:\\n\");\n      print(\"\\tmovq    \\$0, $offset(%rsp,%r11,8)\\n\");\n      print(\"\\tsubq    \\$1, %r11\\n\");\n      print(\"\\tjne     .Lstack_cleaner_loop$id\\n\");\n    }\n\n    print $lines;\n\n    $id = $id + 1;\n    $size = 0;\n    $lines = \"\";\n    $scanproc = 0;\n}\n\nwhile (<>) {\n  if ($scanproc) {\n      $lines = $lines . $_;\n      if (m/^[ \\t]*.cfi_endproc$/) {\n      emit_code();\n      } elsif (m/^[ \\t]*pushq/) {\n      $size += 8;\n      } elsif (m/^[ \\t]*subq[\\t *]\\$([0-9]*),[ \\t]*%rsp$/) {\n          my $val = $1;\n          $val = oct($val) if $val =~ /^0/;\n          $size += $val;\n          emit_code();\n      }\n  } elsif (m/^[ \\t]*.cfi_startproc$/) {\n      print $_;\n\n      $scanproc = 1;\n  } else {\n      print $_;\n  }\n}\n
\n\n\n

This is used as:

\n
# Use either of:\nclean-stack-filter < helloworld.s\ngcc -o- -S hellworld.c | clean-stack-filter | gcc -x assembler -r -o helloworld\n
\n\n\n

And this produces:

\n
main:\n.LFB0:\n    .cfi_startproc\n    movabsq $5, %r11\n.Lstack_cleaner_loop0:\n    movq    $0, -48(%rsp,%r11,8)\n    subq    $1, %r11\n    jne     .Lstack_cleaner_loop0\n    subq    $40, %rsp\n    .cfi_def_cfa_offset 48\n    movl    %edi, 12(%rsp)\n    movq    %rsi, (%rsp)\n    movl    $42, 28(%rsp)\n    movl    $0, %eax\n    call    f\n    movl    $0, %eax\n    addq    $40, %rsp\n    .cfi_def_cfa_offset 8\n    ret\n    .cfi_endproc\n
\n\n\n

Assembler wrapper

\n

A second step is to write an extended assembler as program which\naccepts an extra argument --filter my_shell_command. We could\nhardcode the filtering script in this wrapper but a generic assembler\nwrapper might be reused somewhere else.

\n

We need to:

\n
    \n
  1. \n

    interpret a part of the as command line arguments and our extra\n argument;

    \n
  2. \n
  3. \n

    apply the specified filter on the input assembly;

    \n
  4. \n
  5. \n

    pass the resulting assembly to the real assembler.

    \n
  6. \n
\n
#!/usr/bin/ruby\n# Wrapper around the real `as` which adds filtering capabilities.\n\nrequire \"tempfile\"\nrequire \"fileutils\"\n\ndef wrapped_as(argv)\n\n  args=[]\n  input=nil\n  as=\"as\"\n  filter=\"cat\"\n\n  i = 0\n  while i<argv.size\n    case argv[i]\n\n    when \"--as\"\n      as = argv[i+1]\n      i = i + 1\n    when \"--filter\"\n      filter = argv[i+1]\n      i = i + 1\n\n    when \"-o\", \"-I\"\n      args.push(argv[i])\n      args.push(argv[i+1])\n      i = i + 1\n    when /^-/\n      args.push(argv[i])\n    else\n      if input\n        exit 1\n      else\n        input = argv[i]\n      end\n    end\n    i = i + 1\n  end\n\n  if input==nil\n    # We dont handle pipe yet:\n    exit 1\n  end\n\n  # Generate temp file\n  tempfile = Tempfile.new(\"as-filter\")\n  unless system(filter, 0 => input, 1 => tempfile)\n    status=$?.exitstatus\n    FileUtils.rm tempfile\n    exit status\n  end\n  args.push(tempfile.path)\n\n  # Call the real assembler:\n  res = system(as, *args)\n  status = if res != nil\n             $?.exitstatus\n           else\n             1\n           end\n  FileUtils.rm tempfile\n  exit status\n\nend\n\nwrapped_as(ARGV)\n
\n\n\n

This is used like this:

\n
tools/as --filter \"sed s/world/abcde/\" helloworld.s\n
\n\n\n

We now can ask the compiler to use our assembler wrapper instead of\nthe real system assembler:

\n\n
gcc -B tools/ -Wa,--filter,'sed s/world/abcde/' \\\n  helloworld.c -o helloworld-modified-gcc\n
\n\n\n
clang -no-integrated-as -B tools/ -Wa,--filter,'sed s/world/abcde/' \\\n  helloworld.c -o helloworld-modified-clang\n
\n\n\n

Which produces:

\n
\n$ ./helloworld\nHello world!\n$ ./helloworld-modified-gcc\nHello abcde!\n$ ./helloworld-modified-clang\nHello abcde!\n
\n\n

By combining the two tools, we can get a compiler with stack-cleaning enabled:

\n
gcc -B tools/  -Wa,--filter,'stack-cleaning-filter' \\\n  helloworld.c -o helloworld\n
\n\n\n

Compiler wrapper

\n

Now we can write compiler wrappers which do this job automatically:

\n
#!/bin/sh\npath=(dirname $0)\nexec gcc -B $path -Wa,--filter,\"$path\"/clean-stack-filter \"$@\"\n
\n\n\n
#!/bin/sh\npath=(dirname $0)\nexec g++ -B $path -Wa,--filter,\"$path\"/clean-stack-filter \"$@\"\n
\n\n\n
\n

Warning

\n

As the assembly modification is implemented in as,\nthis compiler wrapper will output the unmodified assembly when using\ncc -S which be surprising. You need to objdump the .o file in\norder to see the effect of the filter.

\n
\n

Result

\n

The whole test suite of SimGrid with model-checking works with this\nimplementation. The next step is to see the impact of this\nmodification on the state comparison of SimGridMC.

"}, {"id": "http://www.gabriel.urdhr.fr/2014/09/26/adding-a-llvm-pass/", "title": "Adding a basic LLVM pass", "url": "https://www.gabriel.urdhr.fr/2014/09/26/adding-a-llvm-pass/", "date_published": "2014-09-26T00:00:00+02:00", "date_modified": "2014-09-26T00:00:00+02:00", "tags": ["computer", "simgrid", "llvm", "compilation", "assembly", "x86_64"], "content_html": "

The SimGrid model checker uses memory introspection (of the heap,\nstack and global variables) in order to detect the equality of the\nstate of a distributed application at the different nodes of its\nexecution graph. One difficulty is to deal with uninitialised\nvariables. The uninitialised global variables are usually not a big\nproblem as their initial value is 0. The heap variables are dealt with\nby memseting to 0 the content of the buffers returned by malloc\nand friends. The case of uninitialised stack variables is more\nproblematic as their value is whatever was at this place on the stack\nbefore. In order to evaluate the impact of those uninitialised\nvariables, we would like to clean each stack frame before using\nthem. This could be done with a LLVM plugin. Here's my first attempt\nto write a LLVM pass to modify the code of a function.

\n

A solution for this, would be to include, at compilation time,\ninstructions to clean the stack frame at the beginning of each\nfunction. This could be implemented as a LLVM\npass:

\n\n

This is mostly relevant when the generated code is not optimised. In\noptimised code, local variables do not need to live on the stack.

\n

Table of Content

\n
\n\n
\n

LLVM overview

\n

A good high level introduction to the LLVM architecture (LLVM IR and\npasses) can be found in The Architecture of Open Source\nApplications.

\n

IR generation

\n

LLVM uses an intermediate language, LLVM\nIR to optimise and generate native\ncode.

\n

For example, a simple hello world like this,

\n
#include <stdio.h>\n\nint main(int argc, char** argv) {\n  puts(\"Hello world!\");\n  return 0;\n}\n
\n\n\n

is turned into this LLVM IR:

\n
; ModuleID = 'helloworld.c'\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n@.str = private unnamed_addr constant [13 x i8] c\"Hello world!\\00\", align 1\n\n; Function Attrs: nounwind uwtable\ndefine i32 @main(i32 %argc, i8** %argv) #0 {\n  %1 = alloca i32, align 4\n  %2 = alloca i32, align 4\n  %3 = alloca i8**, align 8\n  store i32 0, i32* %1\n  store i32 %argc, i32* %2, align 4\n  store i8** %argv, i8*** %3, align 8\n  %4 = call i32 @puts(i8* getelementptr inbounds ([13 x i8]* @.str, i32 0, i32 0))\n  ret i32 0\n}\n\ndeclare i32 @puts(i8*) #1\n\nattributes #0 = { nounwind uwtable \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\nattributes #1 = { \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\n\n!llvm.ident = !{!0}\n\n!0 = metadata !{metadata !\"Debian clang version 3.6.0-svn215195-1 (trunk) (based on LLVM 3.6.0)\"}\n
\n\n\n

by

\n
clang -S -emit-llvm helloworold.c -o helloworld.ll\n
\n\n\n

The generated LLVM IR can be target-dependant as the type of the\nvariables may depend on the architecture/OS:

\n\n

The initial generation of LLVM IR is not done in LLVM but by the\nfrontend (clang, dragonegg\u2026).

\n

LLVM IR passes

\n

Many LLVM optimisations are implemented in an architecture independant\nway by IR passes which transform/optimise IR:

\n
opt -std-compile-opts -S helloworld.ll -o helloworld.opt.ll --time-passes 2> opt.log\n
\n\n\n

Generated IR:

\n
; ModuleID = 'helloworld.ll'\ntarget datalayout = \"e-m:e-i64:64-f80:128-n8:16:32:64-S128\"\ntarget triple = \"x86_64-pc-linux-gnu\"\n\n@.str = private unnamed_addr constant [13 x i8] c\"Hello world!\\00\", align 1\n\n; Function Attrs: nounwind uwtable\ndefine i32 @main(i32 %argc, i8** nocapture readnone %argv) #0 {\n  %1 = tail call i32 @puts(i8* getelementptr inbounds ([13 x i8]* @.str, i64 0, i64 0)) #2\n  ret i32 0\n}\n\n; Function Attrs: nounwind\ndeclare i32 @puts(i8* nocapture readonly) #1\n\nattributes #0 = { nounwind uwtable \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\nattributes #1 = { nounwind \"less-precise-fpmad\"=\"false\" \"no-frame-pointer-elim\"=\"true\" \"no-frame-pointer-elim-non-leaf\" \"no-infs-fp-math\"=\"false\" \"no-nans-fp-math\"=\"false\" \"stack-protector-buffer-size\"=\"8\" \"unsafe-fp-math\"=\"false\" \"use-soft-float\"=\"false\" }\nattributes #2 = { nounwind }\n\n!llvm.ident = !{!0}\n\n!0 = metadata !{metadata !\"Debian clang version 3.6.0-svn215195-1 (trunk) (based on LLVM 3.6.0)\"}\n
\n\n\n

CodeGen passes

\n

This optimized LLVM IR is then used to generate assembly/binary code\nfor the target architecture:

\n
llc  helloworld.opt.ll -o helloworld.s --time-passes 2> llc.log\n
\n\n\n

Generated assembly:

\n
        .text\n        .file   \"/home/foo/temp/helloworld.opt.ll\"\n        .globl  main\n        .align  16, 0x90\n        .type   main,@function\nmain:                                   # @main\n        .cfi_startproc\n# BB#0:\n        pushq   %rbp\n.Ltmp0:\n        .cfi_def_cfa_offset 16\n.Ltmp1:\n        .cfi_offset %rbp, -16\n        movq    %rsp, %rbp\n.Ltmp2:\n        .cfi_def_cfa_register %rbp\n        movl    $.L.str, %edi\n        callq   puts\n        xorl    %eax, %eax\n        popq    %rbp\n        retq\n.Ltmp3:\n        .size   main, .Ltmp3-main\n        .cfi_endproc\n\n        .type   .L.str,@object          # @.str\n        .section        .rodata.str1.1,\"aMS\",@progbits,1\n.L.str:\n        .asciz  \"Hello world!\"\n        .size   .L.str, 13\n\n\n        .ident  \"Debian clang version 3.6.0-svn215195-1 (trunk) (based on LLVM 3.6.0)\"\n        .section        \".note.GNU-stack\",\"\",@progbits\n
\n\n\n

Summary

\n

A LLVM based compiler uses the following\nphases:

\n
    \n
  1. \n

    code analysis (preprocessing, lexing, parsing, semantic\n analysis\u2026);

    \n
  2. \n
  3. \n

    LLVM IR generation (by the compiler);

    \n
  4. \n
  5. \n

    LLVM IR transformation/optimisation (by applying IR passes);

    \n
  6. \n
  7. \n

    native code generation from IR (by applying CodeGen passes).

    \n
  8. \n
\n

Steps 1 and 2 are parts of the code of the compiler. Steps 3 and 4 are\nhandled by the LLVM framework (configurable/pluggable by the\ncompiler).

\n

As we want to touch the content of the stack, we want to add a CodeGen\npass.

\n

Adding a CodeGen pass

\n

Let's first try to add a pass to insert a NOP into every function.

\n

Header

\n

Let's create a new NoopInserter pass (NoopInserter.h). There are\nmeny kinds of passes. This pass is a MachineFunction pass: it is\ncalled (runOnMachineFunction) on each generarated native function\nand can modify it before it is passed to the next pass.

\n
#include <llvm/PassRegistry.h>\n#include <llvm/CodeGen/MachineFunctionPass.h>\n\nnamespace llvm {\n\n  class NoopInserter : public llvm::MachineFunctionPass {\n  public:\n    static char ID;\n    NoopInserter();\n    virtual bool runOnMachineFunction(llvm::MachineFunction &Fn);\n  };\n\n}\n
\n\n\n

The ID is used as a reference to the pass in LLVM: the value of this\nvariable is not important, only its address is used.

\n

Implementation

\n
#include \"NoopInserter.h\"\n\n#include <llvm/CodeGen/MachineInstrBuilder.h>\n#include <llvm/Target/TargetMachine.h>\n#include <llvm/Target/TargetInstrInfo.h>\n#include <llvm/PassManager.h>\n#include <llvm/Transforms/IPO/PassManagerBuilder.h>\n#include <llvm/CodeGen/Passes.h>\n#include <llvm/Target/TargetSubtargetInfo.h>\n#include \"llvm/Pass.h\"\n\n#define GET_INSTRINFO_ENUM\n#include \"../Target/X86/X86GenInstrInfo.inc\"\n\n#define GET_REGINFO_ENUM\n#include \"../Target/X86/X86GenRegisterInfo.inc.tmp\"\n\nnamespace llvm {\n  char NoopInserter::ID = 0;\n\n  NoopInserter::NoopInserter() : llvm::MachineFunctionPass(ID) {\n  }\n\n  bool NoopInserter::runOnMachineFunction(llvm::MachineFunction &fn) {\n    const llvm::TargetInstrInfo &TII = *fn.getSubtarget().getInstrInfo();\n    MachineBasicBlock& bb = *fn.begin();\n    llvm::BuildMI(bb, bb.begin(), llvm::DebugLoc(), TII.get(llvm::X86::NOOP));\n    return true;\n  }\n\n  char& NoopInserterID = NoopInserter::ID;\n}\n\nusing namespace llvm;\n\nINITIALIZE_PASS_BEGIN(NoopInserter, \"noop-inserter\",\n  \"Insert a NOOP\", false, false)\nINITIALIZE_PASS_DEPENDENCY(PEI)\nINITIALIZE_PASS_END(NoopInserter, \"noop-inserter\",\n  \"Insert a NOOP\", false, false)\n
\n\n\n

The runOnMachineFunction method find the beginning of the function\nand insert a X86 NOOP instruction. The method return true in order\nto tell the LLVM framework that this function has been modified by\nthis pass. This implementation will only work on X86/AMD64 targets. A\nreal pass should be target independent or at least check the target.

\n

The INITIALIZE_PASS macros declare the pass and declare its\ndependencies. Here, we are declaring a dependency on PEI a.k.a\nPrologEpilogInserter which adds the prolog and epilog to the code of\nnative function. Those macros define a function:

\n
void initializeNoopInserterPass(PassRegistry &Registry);\n
\n\n\n

The NoopInserterID may be used by other passes to refer to this\npass.

\n

Declarations

\n

We have to add a few declarations of this pass.

\n

In include/llvm/CodeGen/Passes.h:

\n
// NoopInserter - This pass inserts a NOOP instruction\nextern char &NoopInserterID;\n
\n\n\n

In include/llvm/InitializePasses.h:

\n
void initializeNoopInserterPass(PassRegistry &Registry)\n
\n\n\n

Registration

\n

The pass must be added in llvm::initializeCodeGen()\nlib/CodeGen/CodeGen.cpp:

\n
initializeNoopInserterPass(Registry);\n
\n\n\n

Result

\n
clang -O3 helloworld.c -S -o-\n
\n\n\n

We have a nice NOOP:

\n
    .text\n    .file   \"/home/foo/temp/helloworld.c\"\n    .globl  main\n    .align  16, 0x90\n    .type   main,@function\nmain:                                   # @main\n    .cfi_startproc\n# BB#0:                                 # %entry\n    nop\n    pushq   %rax\n.Ltmp0:\n    .cfi_def_cfa_offset 16\n    movl    $.L.str, %edi\n    callq   puts\n    xorl    %eax, %eax\n    popq    %rdx\n    retq\n.Ltmp1:\n    .size   main, .Ltmp1-main\n    .cfi_endproc\n\n    .type   .L.str,@object          # @.str\n    .section    .rodata.str1.1,\"aMS\",@progbits,1\n.L.str:\n    .asciz  \"Hello world!\"\n    .size   .L.str, 13\n\n\n    .ident  \"clang version 3.6.0 \"\n    .section    \".note.GNU-stack\",\"\",@progbits\n
\n\n\n

The program still works:

\n
$ clang -O3 helloworld.c -S -o-\n$ ./a.out\nHello world!\n
\n\n

Conclusion

\n

I successfully managed to add a pass in order to (actively) do nothing\nin each generated native function. In the next episode, I'll try do do\nsomething useful\ninstead.

"}, {"id": "http://www.gabriel.urdhr.fr/2014/07/22/same-page-merging/", "title": "Results on same-page-merging snapshots", "url": "https://www.gabriel.urdhr.fr/2014/07/22/same-page-merging/", "date_published": "2014-07-22T00:00:00+02:00", "date_modified": "2014-07-22T00:00:00+02:00", "tags": ["simgrid", "system", "computer", "checkpoint"], "content_html": "

In the previous episode, I talked about the\nimplementation of a same-page-merging page store. On top of this, we\ncan build same-page-merging snapshots for the SimGrid model checker.

\n

Implementation

\n

SimGrid agnostic layer

\n

The next layer on top of the page store, is\na generic logic for saving and restoring a contiguous area of memory\npages:

\n
/** @brief Take a per-page snapshot of a region\n *\n *  @param data            The start of the region (must be at the beginning of a page)\n *  @param pag_count       Number of pages of the region\n *  @param pagemap         Linux kernel pagemap values for this region (or NULL)\n *  @param reference_pages Snapshot page numbers of the previous mc_soft_dirty_reset() (or NULL)\n *  @return                Snapshot page numbers of this new snapshot\n */\nmc_mem_region_t region* mc_take_page_snapshot_region(\n  void* data, size_t page_count,\n  uint64_t* pagemap, size_t* reference_pages);\n\n/** @brief Restore a snapshot of a region\n *\n *  If possible, the restoration will be incremental\n *  (the modified pages will not be touched).\n *\n*  @param start_addr      Address of the first page where we have to restore the page\n *  @param page_count      Number of pages of the region\n *  @param pagenos         Array of page indices from the global page store\n *  @param pagemap         Linux kernel pagemap values fot this region (or NULL)\n *  @param reference_pages Snapshot page numbers of the previous soft_dirty_reset (or NULL)\n */\nvoid mc_restore_page_snapshot_region(\n  void* start_ddr, size_t page_count,\n  size_t* pagenos,\n  uint64_t* pagemap, size_t* reference_pagenos);\n\n/** @brief Free memory of a page store\n */\nvoid mc_free_page_snapshot_region(\n  size_t* pagenos, size_t page_count);\n\n/** @brief Reset the soft-dirty bits\n *\n *  This is done after checkpointing and after checkpoint restoration\n *  (if per page checkpoiting is used) in order to know which pages were\n *  modified.\n *\n *  See https://www.kernel.org/doc/Documentation/vm/soft-dirty.txt\n * */\nvoid mc_softdirty_reset();\n
\n\n\n

SimGrid snapshot layer

\n

The next layer is SimGrid-specific and handles part of the\nsnapshoting logic:

\n\n

State comparison layer

\n

The most invasive part of this modification in the SimGrid codebase is\nthe logic to read data from the snapshots. Without this feature, a\nsimple offset was applied to find the base of a variable in the\nsnapshot: now, a software MMU algorithm must be done. A variable can\nnow be split across different non-contiguous memory pages. The whole\nlogic of reading from snapshots had to me modified to handle this.

\n

Results

\n

Those results were obtained with the command:

\n
# COMMAND: sendrecv2, mprobe or sendall\n# SPARSE, SOFTDIRTY: yes or no\ncd teshsuite/smpi/mpich3-test/pt2pt/\nexport TIME=\"clock:%e user:%U sys:%S swapped:%W exitval:%x max:%Mk\"\nsetarch x86_64 -R time smpirun -hostfile ../hostfile -platform $(find ../../../.. -name small_platform_with_routers.xml) --cfg=maxmin/precision:1e-9 --cfg=network/model:SMPI --cfg=network/TCP_gamma:4194304 -np 4 --cfg=model-check:1 --cfg=smpi/send_is_detached_thres:0 --cfg=smpi/coll_selector:mpich --cfg=contexts/factory:ucontext --cfg=model-check/max_depth:100000 --cfg=model-check/reduction:none --cfg=model-check/visited:100000 --cfg=contexts/stack_size:4 --cfg=model-check/sparse-checkpoint:$SPARSE --cfg=model-check/soft-dirty:$SOFTDIRTY $COMMAND\n
\n\n\n

They were run on a laptop with quad-core Intel\u00ae Core\u2122 i7-3687U\nCPU @ 2.10GHz with 8GiB of RAM. Note that the memory reported is the\nRSS and does include swapped-out memory.

\n

sendrecv2

\n

In this example, we observe a 80% reduction of the memory consumption\nfor a slight slowdown. Using soft-dirty tracking does not have a\npositive impact on the performance: some time is gained in user land\nby avoiding comparing memory pages but the same amount of time is\nspend in kernel space tracking the soft-clean/soft-dirty pages.

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TypeclockusersystemMax. RSS (KiB)
Simple snapshot9.96s9.16s0.78s3 332 788
Same-page-merging snapshot w/o soft-dirty tracking10.02s9.82s0.19s540 420
Same-page-merging snapshot with soft-dirty tracking10.70s8.86s1.80s540 936
\n

mprobe

\n

Similar results here:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TypeclockusersystemMax. RSS (KiB)
Simple snapshot13.41s13.00s0.40s1 692 492
Same-page-merging snapshot w/o soft-dirty tracking14.12s13.89s0.14s414 916
Same-page-merging snapshot with soft-dirty tracking14.44s13.16s1.25s415 028
\n

sendflood

\n

In this example, without the same-page-merging snapshot we hit the\nswap limit (the RSS does not include the swapped-out memory). In this\ncase, using same-page-merging snapshot is faster because the process\ndoes not swap. Using soft-dirty tracking does not have a beneficial\nimpact in this case either: a lot of a time is lost marking the pages\nas soft-dirty/soft-clean.

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
TypeclockusersystemMax. RSS (KiB)
Simple snapshot73.31s56.34s5.26s7 213 956
Same-page-merging snapshot w/o soft-dirty tracking59.12s56.87s2.22s1 570 312
Same-page-merging snapshot with soft-dirty tracking82.74s53.71s29.06s1 609 048
\n

Conclusion

\n

This approach achieves an important reduction of the memory\nconsumption without a significant impact on performance. With this\ntechnique we should be able to handle bigger applications problem,\nsave more states of the application. Those tests were run on\napplications where a lot of pages change between snapshots. On\napplications where many pages are not modified, the reduction of\nmemory consumption should be much more bigger.

\n

Soft-dirty tracking does not seem to be very efficient in our\ntests. It might be useful if the applications is swapping by avoiding\nto swap when taking a snapshot. This feature will probably be disabled\nby default and might be removed in the future.

\n

It should be possible to increase the efficiency of the method by\nincreasing page sharing:

\n\n

It should be possible to speedup the process by\u00a0:

\n\n

We used the granularity of the memory page but it is not strictly\nnecessary. We might use a finer granularity in order to increase the\nsharing between snapshots. The granularity (the size of the chunks)\nshould be regular and a power of 2 (in order to be able to apply the\nMMU algorithm). However, the memory overhead would be greater (index\nof the page chunk store number of page chunk indices\nstored for each snapshot).

"}, {"id": "http://www.gabriel.urdhr.fr/2014/06/13/page-store/", "title": "Page store for the Simgrid model checker", "url": "https://www.gabriel.urdhr.fr/2014/06/13/page-store/", "date_published": "2014-06-13T00:00:00+02:00", "date_modified": "2014-06-13T00:00:00+02:00", "tags": ["simgrid", "system", "computer", "checkpoint"], "content_html": "

The first (lower) layer of the per-page snapshot mechanism is a page\nstore: its responsibility is to store immutable shareable\nreference-counted memory pages independently of the snapshoting\nlogic. Snapshot management and representation, soft-dirty tracking\nwill be handled in an higher layer.

\n

Data structure

\n
class s_mc_pages_store {\n\n  typedef uint64_t hash_type;\n  typedef boost::unordered_set<size_t> page_set_type;\n  typedef boost::unordered_map<hash_type, page_set_type> pages_map_type;\n\n  void* memory_;\n  size_t capacity_;\n  size_t top_index_;\n  std::vector<uint64_t> page_counts_;\n  std::vector<size_t> free_pages_;\n  pages_map_type hash_index_;\n\n  // [... Methods]\n\n};\n
\n\n\n

In this initial version, the structure of the page store is made of:

\n\n

We want to keep this memory region (*memory_) aligned on the memory pages (so\nthat we might be able to create non-linear memory mappings on those\npages in the future) and be able to expand it without copying the\ndata (there will be a lot of pages here): we will be able to\nefficiently expand the memory mapping using mremap(), moving it\nto another virtual address if necessary.

\n
void* new_memory = mremap(this->memory_, this->capacity_ << xbt_pagebits, newsize << xbt_pagebits, MREMAP_MAYMOVE);\nif (new_memory == MAP_FAILED) {\n xbt_die(\"Could not mremap snapshot pages.\");\n}\nthis->capacity_ = newsize;\nthis->memory_ = new_memory;\nthis->page_counts_.resize(newsize, 0);\n
\n\n\n

Because we will move this memory mapping on the virtual address\nspace, we only need to store index of the page in the snapshots\nand the page will always be looked up by going through memory_:

\n
const void* s_mc_pages_store::get_page(size_t pageno) const {\n return (char*) this->memory_ + pageno << pagebits;\n}\n
\n\n\n

API

\n
class s_mc_pages_store {\n  // [...]\n\npublic: // Ctor and dtor\n  explicit s_mc_pages_store(size_t size);\n  ~s_mc_pages_store();\n\npublic: // API\n\n  void unref_page(size_t pageno);\n  void ref_page(size_t pageno);\n  size_t store_page(void* page);\n  const void* get_page(size_t pageno) const;\n\nprivate:\n  size_t alloc_page();\n\n};\n
\n\n\n

get_page()

\n

get_page() returns a pointer to the memory from its index.

\n
const void* s_mc_pages_store::get_page(size_t pageno) const {\n  return (char*) this->memory_ + pageno << pagebits;\n}\n
\n\n\n

store_page()

\n

store_page() is used to store a page in the page store and return\nthe index of the stored page.

\n
size_t s_mc_pages_store::store_page(void* page)\n{\n
\n\n\n

First, we check if a page with the same content is already in the page\nstore:

\n
    \n
  1. compute the hash of the page;
  2. \n
  3. find pages with the same hash using hash_index_;
  4. \n
  5. memcmp() those pages with the one we are inserting to find a page with the same content.
  6. \n
\n
  uint64_t hash = mc_hash_page(page);\n  page_set_type& page_set = this->hash_index_[hash];\n  BOOST_FOREACH (size_t pageno, page_set) {\n    const void* snapshot_page = this->get_page(pageno);\n    if (memcmp(page, snapshot_page, xbt_pagesize) == 0) {\n
\n\n\n

If a page with the same content is already in the page store it is\nreused and its reference count is incremented.

\n
      page_counts_[pageno]++;\n      return pageno;\n    }\n  }\n
\n\n\n

Otherwise, a new page is allocated in the page store and the content\nof the page is memcpy()-ed to this new page.

\n
  size_t pageno = this->alloc_page();\n  void* snapshot_page = (void*) this->get_page(pageno);\n  memcpy(snapshot_page, page, xbt_pagesize);\n  page_set.insert(pageno);\n  page_counts_[pageno]++;\n  return pageno;\n}\n
\n\n\n

ref_page()

\n

This method used to increase a reference count of a page if we know\nthat the content of a page is the same as a page already in the page\nstore.

\n

This will be the case if a page if soft clean: we know that is has not\nchanged since the previous snapshot/restoration and we can avoid\nhashing the page, comparing byte-per-byte to candidates.

\n
void s_mc_pages_store::ref_page(size_t pageno) {\n  ++this->page_counts_[pageno];\n}\n
\n\n\n

unref_page()

\n

Decrement the reference count of this page. Used when a snapshot is\ndestroyed.

\n

If the reference count reaches zero, the page is recycled: it is added\nto the free_pages_ list and removed from the hash_index_. In the\ncurrent implementation, we need to hash the page in order to find it\nin the index.

\n
void s_mc_pages_store::unref_page(size_t pageno) {\n  if ((--this->page_counts_[pageno]) == 0) {\n    this->free_pages_.push_back(pageno);\n    void* page = ((char*)this->memory_ + (pageno << pagebits));\n    uint64_t hash = mc_hash_page(page);\n    this->hash_index_[hash].erase(pageno);\n  }\n}\n
\n\n\n

Tweaks and improvements

\n

Which hashing algorithm?

\n

Currently the code is using djb2\nbut other hashes such as\nMurmur or\nCityHash are probably\nbetter.

\n

Shared memory page store

\n

It is very easy to use a file (shared memory, FS file, block-device)\ninstead of anonymous memory for the page store: could we use this to\nparallelise the model checker on different processes or even machines?

\n

References

\n"}, {"id": "http://www.gabriel.urdhr.fr/2014/06/03/non-cow-snapshots/", "title": "Per-page shallow snapshots for the SimGrid model checker", "url": "https://www.gabriel.urdhr.fr/2014/06/03/non-cow-snapshots/", "date_published": "2014-06-03T00:00:00+02:00", "date_modified": "2014-06-03T00:00:00+02:00", "tags": ["simgrid", "system", "computer", "checkpoint"], "content_html": "

I looked at my options to achieve efficient/cheap snapshots of the\nsimulated application for the Simgrid model checker using\ncopy-on-write. Here I look at another\nsolution to achieve this without using copy-on-write.

\n

Checkpointing

\n

Basic idea

\n

The idea is to save each page of the state of the application\nindependently: when a snapshot page is stored, the snapshoting logic\nfirst checks if a page with the same content is already stored\nin the snapshot pages:

\n\n

The memory pages are only shared between the different snapshots but\nare never shared with the simulated application: copy-on-write is not\nused which means that the simulated application will not be slowned\ndown by the unsharing page faults. As a result, the basic solution can\nbe implemented purely in userspace.

\n

The first snapshot will be a full snapshot. Other snapshots will\nusually be shallow: if 98% of the memory pages are not touched between\nsuccessive snapshots, all those pages will be shared and only 2%\nof the pages will be copied in the second snapshot.

\n

A hash of the content of the page can be used to limit the\ncomparison of the new memory page with only a subset of the stored\nmemory pages.

\n

Better snapshots with soft-dirty page tracking

\n

It is still necessary to scan and hash all the pages of the\nstate of the simulated process each time a snapshot is done\nwhich seems to be quite inefficient.\nWe can use the\nsoft-dirty\nfeature of the Linux kernel to detect which pages have been written\nsince the previous snapshot and only try to store the modified\nones.

\n

After each snapshot, each page of the process is marked as soft-clean\nand protected against write. Each time a soft-clean page is touched, a\npage fault is raised: the kernel marks the page as soft-dirty and\nremove the protection on the page. On the next snapshot, it is\npossible to find which pages are soft-dirty (i.e. were modified since\nthe previous snapshot) and only save those pages.

\n

Incremental snapshot restoration

\n

Even when restoring the state of a snapshot, we might use the\nsoft-dirty information to avoid copying data which have not changed:

\n\n

If a lot of pages do not change between snapshots, this technique\nreduces the number of pages which needs to be copied to restore\na snapshot (and avoid the related soft-dirty page faults).

\n

Memory address translation

\n

Once an efficient snapshoting strategy is implemented, I expect that\nin many cases, most of the time will be spent in the state comparison\ncode: we need to find a solution to avoid spending too much time\ntranslating between the addresses of the simulated application and the\naddresses of the snapshots.

\n

Create a linear view of the areas of the snapshots

\n

We might want to create a linear view of the areas of each\nsnapshot of in order to have a simple code for the address\ntranslation:

\n
find_snapshot_address(real_address, snapshot)\n{\n  memory_area             \u2190 find_memory_area(real_address)\n  offset                  \u2190 real_address - memory_area.start\n  snapshot_area           \u2190 find_snapshot_area(memory_area)\n  snapshot_address        \u2190 snapshot_area.start + offset\n  return snapshot_address\n}\n
\n\n\n

Moreover and probably more importantly, as long as we are in the same\nmemory region, we can apply an offset in the real address translate\ninto the same offset in the linear view of the snapshot. This case\nhappens all the time when we are comparing the states:

\n\n

In all thoses cases, we could apply a simple offset from the base\nsnapshot address: if the memory\npages of the snapshot are scattered in the virtual memory space, the\nmodel checker will have to\napply the offset to the real base address\nand then translate the resulting address.

\n

\u2026 using non-linear memory mappings

\n

One solution to create a linear view of a snapshot memory region,\nwould be to use a non-linear memory mapping (remap_file_pages) of\nthe snapshot memory:

\n\n

However, one remap_file_pages() call will be necessary per each\nmemory page so I do not expect this solution to be very promising\nunless a more efficient version of this system call is added in a\nlater release of the Linux kernel.

\n

Update: remap_file_page is\ndeprecated.

\n

\u2026 using incremental snapshot reconstruction

\n

Another solution, is to create a linear copy of the snapshot areas.\nWe incrementally update those copies to reflect different snapshots\nbut only updating the pages which are different between the different\nsnapshots.

\n

When we want to compare the current state against another one, we\nfirst have to recreate a linear view of the snapshot of the latter by\ncopying in the linear view all the memory pages which are different\nfrom the previous view.

\n

We want avoid reconstructing the state memory. This can be done by\ncreating a global hash of the state of the simulated appliaction\nbased on key characteristics of the state (such\nas the numberf of processes, the instruction pointers of each process\nin its stack frame \u2026).

\n

Software MMU

\n

The other solution is to replicate the algorithm of the MMU in\nsoftware to translate from virtual pages into file pages:

\n
find_snapshot_address(real_address, snapshot)\n{\n  page_number             \u2190 get_page_number(address)\n  offset                  \u2190 get_offset(address)\n  snapshot_page_number    \u2190 get_snapshot_page_number(snapshot, page_number)\n  snapshot_page_address   \u2190 get_page_address(snapshot_page_number)\n  snapshot_address        \u2190 snapshot_page_address + offset\n  return snapshot_address\n}\n
\n\n\n

As I said earlier, this might impact the performance of the state\ncomparison.

\n

Other granularity

\n

We might use another granularity instead of the page:\nfor example we might snapshot at the malloc()\ngranularity:

\n\n

Conclusion

\n

This approach seems quite promising:

\n\n

It is not clear which variation will be the more efficient. I'm\nprobably going to implement the software MMU approach.

"}, {"id": "http://www.gabriel.urdhr.fr/2014/06/02/cow-snapshots/", "title": "Copy-on-write snapshots for the SimGrid model checker", "url": "https://www.gabriel.urdhr.fr/2014/06/02/cow-snapshots/", "date_published": "2014-06-02T00:00:00+02:00", "date_modified": "2014-06-02T00:00:00+02:00", "tags": ["simgrid", "system", "computer", "checkpoint"], "content_html": "

The SimGrid model checker\nexplores the graph of possible executions of\na simulated distributed application in order to verify safety and\nliveness properties. The model checker needs to store the state of the\napplication in each node of the execution graph in order to detect\ncycles. However, saving the whole state of the application at each\nnode of the graph leads to huge memory consumption and in some\ncases most of the time is spent copying data in order to take the\nsnapshots of the application. We will see how we could solve this problem,\nusing copy-on-write.

\n

Current state

\n

SimGrid simulates a distributed application on a single\nmachine in a single OS process: this allows very efficient\ntask switching as it can be done completely in\nuserspace. All simulated process uses a shared heap and\neach one uses its own stack which is allocated on this shared heap.

\n

The model checker lives in the same OS process and uses a separate\nheap. Each time it needs to take a snapshot of the application, the\nmodel checker makes a copy (using memcpy()) of each memory area which\nis considered to contain a part of the state of the application:

\n\n

Saving a lot of snapshots of the application can use a lot of memory. Some of\nthe applications we are trying to model-check use the whole 256 GiB of\nRAM of the machines we are using. Moreover in some applications, most\nof the time is spent copying the data (the diagram is made with\nFlameGraph):

\n
\n \n \n \n
\n Model cheker on the sp.S.4 benchmark: \n 83% of the time is spent in memcpy()\n taking snapshots of the application\n
\n
\n\n
smpirun -wrapper \"perf record -g -e cycles\" -hostfile hostfile -platform msg_platform.xml -np 4 --cfg=model-check:1 --cfg=model-check/reduction:none --cfg=model-check/communications_determinism:1 --cfg=smpi/send_is_detached_thres:0 --cfg=model-check/max_depth:100000 --cfg=smpi/running_power:1e9 --cfg=contexts/factory:ucontext --cfg=model-check/visited:100 ./sp.S.4\nperf script | ~/src/FlameGraph/stackcollapse-perf.pl | grep -v '^\\[unknown\\];' | ~/src/FlameGraph/flamegraph.pl > sp.S.4.svg\n
\n\n\n

In practice, in many applications, only a small part of the memory of the\napplication has changed between successive states. In order to\nevaluate this, I modified the model-checker to use the Linux\nsoft-dirty\nmechanism:

\n
    \n
  1. \n

    after each snapshot, each memory page of the application is\n marked as soft-clean;

    \n
  2. \n
  3. \n

    before doing the snapshot every page which is still soft-clean\n has not been modified by the application\n since the previous snapshot and could be shared\n with the previous snapshot.

    \n
  4. \n
\n

On the previous benchmark\n(sp.S.4 from the NAS Parallel Benchmarks Version 3.3),\n99% of the memory\npages of the state of the application\nwere not touched between successive snapshots: at least 99% of\nthe memory could be shared between successive snapshots \nwhen analysing this application.

\n

Based on this observation, we would like to find a smarter way\nto take snapshots of the application with the following goals in mind:

\n
    \n
  1. \n

    share memory between the common parts of the snapshots;

    \n
  2. \n
  3. \n

    avoid copying data as much as possible;

    \n
  4. \n
  5. \n

    being able to share memory even after state restoration;

    \n
  6. \n
  7. \n

    make an efficient restoration of the state when the model-checker\n needs to backtrack in the graph of executions.

    \n
  8. \n
\n

KSM

\n

The KSM (Kernel Samepage Merging)\nmechanism of the Linux kernel can be\nused to enable automatic page sharing between snapshots:\nthe kernel finds memory pages with the same content, merges them\nand uses copy-on-write to unshare them if one of the virtual page is\nmodified later on.

\n

In order to do this, the application must mark each memory region\nwhere it wants the kernel to detect mergeable pages:

\n
madvise(start, length, MADV_MERGEABLE);\n
\n\n\n

KSM must be enabled system-wide (as root) with:

\n
# Enable KSM:\necho 1 > /sys/kernel/mm/ksm/run\n# Scan more pages:\necho 10000 > /sys/kernel/mm/ksm/pages_to_scan\n
\n\n\n

This solution is quite nonintrusive and has been implemented.

\n

However, it does not address our second goal (avoid copying the data),\nthe page must be completely copied and then the KSM kernel process\nwill scan it to unmerge it. Moreover, scanning the pages in order to\nfind duplicates in quite CPU intensive. As a result (this part needs\nto be verified), the memory pages are deduplicated slower than they\nare allocated which means that the memory reduction is very limited in\npractice.

\n

This leads to the idea of doing explicit copy-on-write instead.

\n

Copy-on-write implementations

\n

Copy-on-write in used on most POSIXish systems by the fork()\nfunction. In the case of single-threaded application, a forked\nprocess could be seen as a snaphot of the simulated application.\nHowever, the snapshot memory does not live in the same virtual address\nspace and is not easily available to the model checker without copying\nit back into the main process.

\n

Using mprotect()

\n

A possible solution to implement copy-on-write would be to implement\nit in userspace using mprotect()\nand remap_file_pages():

\n\n

A memory-backed (tmpfs) file is used as an intermediate level between logical\npages and physical pages:\nphysical memory \u2192 file memory \u2192 virtual memory or swap.\nThe remap_file_pages() Linux system call can be used to create a\nnon-linear mapping between physical pages and file pages.

\n

However, this does not seem a suitable solution:

\n\n

Update: remap_file_page is\ndeprecated.

\n

Native copy-on-write

\n

Some operating systems expose an in-process copy-on-write\nfunctionality. Some Mach-based systems expose it using the\nvm_remap()\nMach call.\nHowever, the only 64 bits OS supporting this seems to be\nXNU/Darwin/MacOS X/IOS:\nporting the model checker to XNU systems would\ntake a lot of time (and it seems Darwin without MacOS X is quite dead\nanyway). It sounded like a good excuse to try the Hurd\nwhich is based on a Mach kernel\nbut I discovered that it does not work wit 64 bits systems.

\n

Linux does not expose an in-process copy-on-write functionality. I\ncould try to add this feature to the Linux kernel: the copy-on-write\nlogic would not be touched, the only missing bits is code to setup the\ncopy-on-write regions properly inside the same process and a interface\n(syscall option\u2026) to trigger it from userspace.\nI'm not sure our chances of merging this feature would be very high.\nbut this might be a solution worth exploring in the future.

\n

A native copy-on-write solution should address all of our goals. Page\nfaults with page deduplication will slow the application down: in\npractice if a small number of pages are modified between different\nsnapshots this should not be a big issue and I expect that it would\nstill be a big win compared to the current implementation.

\n

Page-level snapshots

\n

In the next episode, we will have a\nlook at non copy-on-write solutions based on userspace-managed page-level\nsnapshots.

\n

References

\n"}, {"id": "http://www.gabriel.urdhr.fr/2014/05/23/flamegraph/", "title": "Profiling and optimising with Flamegraph", "url": "https://www.gabriel.urdhr.fr/2014/05/23/flamegraph/", "date_published": "2014-05-23T00:00:00+02:00", "date_modified": "2014-05-23T00:00:00+02:00", "tags": ["simgrid", "optimisation", "profiling", "computer", "flamegraph", "unix", "gdb", "perf"], "content_html": "

Flamegraph\nis a software which generates SVG graphics\nto visualise stack-sampling based\nprofiles. It processes data collected with tools such as Linux perf,\nSystemTap, DTrace.

\n

For the impatient:

\n\n

Table of Content

\n
\n\n
\n

Profiling by sampling the stack

\n

The idea is that in order to know where your application is using CPU\ntime, you should sample its stack. You can get one sample of the\nstack(s) of a process with GDB:

\n
# Sample the stack of the main (first) thread of a process:\ngdb -ex \"set pagination 0\" -ex \"bt\" -batch -p $(pidof okular)\n\n# Sample the stack of all threads of the process:\ngdb -ex \"set pagination 0\" -ex \"thread apply all bt\" -batch -p $(pidof okular)\n
\n\n\n

This generates backtraces such as:

\n
[...]\nThread 2 (Thread 0x7f4d7bd56700 (LWP 15156)):\n#0  0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6\n#1  0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=2, fds=0x7f4d70002e70, timeout=-1, context=0x7f4d700009a0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028\n#2  g_main_context_iterate (context=context@entry=0x7f4d700009a0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729\n#3  0x00007f4d933750ec in g_main_context_iteration (context=0x7f4d700009a0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795\n#4  0x00007f4d9718b676 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#5  0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#6  0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#7  0x00007f4d97059bef in QThread::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#8  0x00007f4d9713e763 in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#9  0x00007f4d9705c2bf in ?? () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#10 0x00007f4d93855062 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0\n#11 0x00007f4d96796c1d in clone () from /lib/x86_64-linux-gnu/libc.so.6\n\nThread 1 (Thread 0x7f4d997ab780 (LWP 15150)):\n#0  0x00007f4d9678b90d in poll () from /lib/x86_64-linux-gnu/libc.so.6\n#1  0x00007f4d93374fe4 in g_main_context_poll (priority=2147483647, n_fds=8, fds=0x2f8a940, timeout=1998, context=0x1c747e0) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:4028\n#2  g_main_context_iterate (context=context@entry=0x1c747e0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3729\n#3  0x00007f4d933750ec in g_main_context_iteration (context=0x1c747e0, may_block=1) at /tmp/buildd/glib2.0-2.40.0/./glib/gmain.c:3795\n#4  0x00007f4d9718b655 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#5  0x00007f4d97c017c6 in ?? () from /usr/lib/x86_64-linux-gnu/libQtGui.so.4\n#6  0x00007f4d9715cfef in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#7  0x00007f4d9715d2e5 in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#8  0x00007f4d97162ab9 in QCoreApplication::exec() () from /usr/lib/x86_64-linux-gnu/libQtCore.so.4\n#9  0x00000000004082d6 in ?? ()\n#10 0x00007f4d966d2b45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6\n#11 0x0000000000409181 in _start ()\n[...]\n
\n\n\n

By doing this a few times, you should be able to have an idea of\nwhat's taking time in your process (or thread).

\n

Using FlameGraph for visualising stack samples

\n

Taking a few random stack samples of the process might be fine and\nhelp you in some cases but in order to have more accurate information,\nyou might want to take a lot of stack samples. FlameGraph can help you\nvisualize those stack samples.

\n

How does FlameGraph work?

\n

FlameGraph reads a file from the standard input representing stack\nsamples in a simple format where each line represents a type of stack\nand the number of samples:

\n
main;init;init_boson_processor;malloc  2\nmain;init;init_logging;malloc          4\nmain;processing;compyte_value          8\nmain;cleanup;free                      3\n
\n\n\n

FlameGraph generates a corresponding SVG representation:

\n
\n\n \"[corresponding\n\n
Corresponding FlameGraph output
\n
\n\n

FlameGraph ships with a set of preprocessing scripts\n(stackcollapse-*.pl) used to convert data from various\nperformance/profiling tools into this simple format\nwhich means you can use FlameGraph with perf, DTrace,\nSystemTap or your own tool:

\n
your_tool | flamegraph_preprocessor_for_your_tool | flamegraph > result.svg\n
\n\n\n

It is very easy to add support for a new tool in a few lines of\nscripts. I wrote a\npreprocessor\nfor the GDB backtrace output (produced by the previous poor man's\nprofiler script) which is now available\nin the main repository.

\n

As FlameGraph uses a tool-neutral line-oriented format, it is very\neasy to add generic filters after the preprocessor (using sed,\ngrep\u2026):

\n
the_tool | flamegraph_preprocessor_for_the_tool | filters | flamegraph > result.svg\n
\n\n\n

Update 2015-08-22:\nElfutils ships a stack program\n(called eu-stack on Debian) which seems to be much faster than GDB\nfor using as a Poor man's Profiler in a shell script. I wrote a\nscript in order to feed its output to\nFlameGraph.

\n

Using FlameGraph with perf

\n

perf is a very powerful tool for Linux to do performance analysis of\nprograms. For example, here's how we can generate a\non-CPU\nFlameGraph of an application using perf:

\n
# Use perf to do a time based sampling of an application (on-CPU):\nperf record -F99 --call-graph dwarf myapp\n\n# Turn the data into a cute SVG:\nperf script | stackcollapse-perf.pl | flamegraph.pl > myapp.svg\n
\n\n\n

This samples the on-CPU time, excluding time when the process in not\nscheduled (idle, waiting on a semaphore\u2026) which may not be what you\nwant. It is possible to sample\noff-CPU\ntime as well with\nperf.

\n

The simple and fast solution1 is to use the frame pointer\nto unwind the stack frames (--call-graph fp). However, frame pointer\ntends to be omitted these days (it is not mandated by the x86_64 ABI):\nit might not work very well unless you recompile code and dependencies\nwithout omitting the frame pointer (-fno-omit-frame-pointer).

\n

Another solution is to use CFI to unwind the stack (with --call-graph\ndwarf): this uses either the DWARF CFI (.debug_frame section) or\nruntime stack unwinding (.eh_frame section). The CFI must be present\nin the application and shared-objects (with\n-fasynchronous-unwind-tables or -g). On x86_64, .eh_frame should\nbe enabled by default.

\n

Update 2015-09-19: Another solution on recent Intel chips (and\nrecent kernels) is to use the hardware LBR\nregisters (with --call-graph\nlbr).

\n

Transforming and filtering the data

\n

As FlameGraph uses a simple line oriented format, it is very easy to\nfilter/transform the data by placing a filter between the\nstackcollapse preprocessor and FlameGraph:

\n
# I'm only interested in what's happening in MAIN():\nperf script | stackcollapse-perf.pl | grep MAIN | flamegraph.pl > MAIN.svg\n\n# I'm not interested in what's happening in init():\nperf script | stackcollapse-perf.pl | grep -v init | flamegraph.pl > noinit.svg\n\n# Let's pretend that realloc() is the same thing as malloc():\nperf script | stackcollapse-perf.pl | sed/realloc/malloc/ | flamegraph.pl > alloc.svg\n
\n\n\n

If you have recursive calls you might want to merge them in order to\nhave a more readable view. This is implemented in my\nbranch\nby stackfilter-recursive.pl:

\n
# I want to merge recursive calls:\nperf script | stackcollapse-perf.pl | stackfilter-recursive.pl | grep MAIN | flamegraph.pl\n
\n\n\n

Update 2015-10-16: this has been merged upstream.

\n

Using FlameGraph with the poor man's profiler (based on GDB)

\n

Sometimes you might not be able to get relevant information with\nperf. This might be because you do not have debugging symbols for\nsome libraries you are using: you will end up with missing\ninformation in the stacktrace. In this case, you might want to use GDB\ninstead using the poor man's profiler\nmethod because it tends to be better at unwinding the stack without\nframe pointer and debugging information:

\n
# Sample an already running process:\npmp 500 0.1 $(pidof mycommand) > mycommand.gdb\n\n# Or:\nmycommand my_arguments &\npmp 500 0.1 $!\n\n# Generate the SVG:\ncat mycommand.gdb | stackcollapse-gdb.pl | flamegraph.pl > mycommand.svg\n
\n\n\n

Where pmp is a poor man's profiler script such as:

\n
#!/bin/bash\n# pmp - \"Poor man's profiler\" - Inspired by http://poormansprofiler.org/\n# See also: http://dom.as/tag/gdb/\n\nnsamples=$1\nsleeptime=$2\npid=$3\n\n# Sample stack traces:\nfor x in $(seq 1 $nsamples); do\n  gdb -ex \"set pagination 0\" -ex \"thread apply all bt\" -batch -p $pid 2> /dev/null\n  sleep $sleeptime\ndone\n
\n\n\n

Using this technique will slow the application a lot.

\n

Compared to the example with perf, this approach samples both on-CPU\nand off-CPU time.

\n

A real world example of optimisation with FlameGraph

\n

Here are some figures obtained when I was optimising the\nSimgrid\nmodel checker\non a given application\nusing the poor man's profiler to sample the stack.

\n

Here is the original profile before optimisation:

\n
\n\n \n\n
FlameGraph before optimisation
\n
\n\n

Avoid looking up data in a hash table

\n

Nearly 65% of the time is spent in get_type_description(). In fact, the\nmodel checker spends its time looking up type description in some hash tables\nagain and over again.

\n

Let's fix this and store a pointer to the type description instead of\na type identifier in order to avoid looking up those type over\nand over again:

\n
\n\n \"[profile\n\n
FlameGraph after avoiding the type lookups
\n
\n\n

Cache the memory areas addresses

\n

After this modification,\n32% of the time is spent in libunwind get_proc_name() (looking up\nfunctions name from given values of the instruction pointer) and\n12% is spent reading and parsing the output of cat\n/proc/self/maps over and over again. Let's fix the second issue first\nbecause it is simple, we cache the memory mapping of the process in\norder to avoid parsing /proc/self/maps all of time.

\n
\n\n \"[profile\n\n
FlameGraph after caching the /proc/self/maps output
\n
\n\n

Speed up function resolution

\n

Now, let's fix the other issue by resolving the functions\nourselves. It turns out, we already had the address range of each function\nin memory (parsed from DWARF informations). All we have to do is use a\nbinary search in order to have a nice O(log n) lookup.

\n
\n\n \"[profile\n\n
FlameGraph after optimising the function lookups
\n
\n\n

Avoid looking up data in a hash table (again)

\n

Still 10% of the time is spent looking up type descriptions from type\nidentifiers in a hash tables. Let's store the reference to the type\ndescriptions and avoid this:

\n
\n\n \"profile\n\n
FlameGraph after avoiding some remaining type lookups
\n
\n\n

Result

\n

The non-optimised version was taking 2 minutes to complete. With\nthose optimisations, it takes only 6 seconds \"\ud83d\ude2e\". There is\nstill room for optimisation here as 30% of the time is now spent in\nmalloc()/free() managing heap information.

\n

Remaining stuff

\n

Sampling other events

\n

Perf can sample many other kind of events (hardware performance\ncounters, software performance counters, tracepoints\u2026). You can get\nthe list of available events with perf list. If you run it as\nroot you will have a lot more events (all the kernel tracepoints).

\n

Here are some interesting events:

\n\n

More information about some perf events can be found in\nperf_event_open(2).

\n

You can then sample an event with:

\n
perf record --call-graph dwarf -e cache-misses myapp\n
\n\n\n
\n\n \"[FlameGraphe\n\n
FlameGraph of cache misses
\n
\n\n

Ideas

\n\n

Extra tips

\n\n

References

\n\n
\n
\n
    \n
  1. \n

    When using frame pointer unwinding, the kernel unwinds the stack\nitself and only gives the instruction pointer of each frame to\nperf record. This behaviour is triggered by the\nPERF_SAMPLE_CALLCHAIN sample type.

    \n

    When using DWARF unwinding, the kernels takes a snaphots of (a\npart of) the stack, gives it to perf record: perf record\nstores it in a file and the DWARF unwinding is done afterwards by\nthe perf tools. This uses\nPERF_SAMPLE_STACK_USER. PERF_SAMPLE_CALLCHAIN is used as well\nbut for the kernel-side stack (exclude_callchain_user).\u00a0\u21a9

    \n
  2. \n
\n
"}]}