Collaborating in the Helix Lab Cloud to Fix a Software Problem - How to do Remote Help Right

Collaborating in the Helix Lab Cloud to Fix a Software Problem - How to do Remote Help Right

jakob-engblom-intro-picture

As you might know, we recently announced and launched the preview version of the Helix Lab Cloud. With Helix Lab Cloud, you access (simulated) computers in the cloud, and among other unique abilities, you can share a session with other people. Thanks to the underlying Simics technology, sharing a session covers both the current state of the lab machine and the history of inputs and outputs. Sharing is a great tool for collaborative debugging and error reporting, and in this blog post I will tell the story of what I believe was the first application of this to a real problem, in which this author helped a colleague solve an annoying compilation failure happening on a lab machine.

The story starts with an Intel-based Linux machine inside the Lab Cloud.  My colleague, let’s call him Nemo, booted up the lab machine and proceeded to upload a fairly large application to its disk.  Once the application had been uploaded, he ran configure to set up the build, and then make to build it. Using known good Linux application that built and worked correctly on many other machines, he quite rightly expected it to build without a hitch in the Lab Cloud as well.

However, this was not the case. Instead there were strange errors.  Configure would complete successfully. But then make would complain about time stamps and run configure again. Repeat. Not a good place to be when the goal was to make progress.

So what was going on? This question was first asked in the normal way, the way you have always asked for help in the past when encountering an issue in a system you are testing. We in the Helix Lab Cloud team received an email from Nemo with a question on why things did not work, and including a copy of the output seen on the target as configure and make looped (a static capture from the target terminal).  A few more emails went back and forth, including a few more examples of output.  No progress.

At this point, someone remembered that we had a system that was supposed to let us share things directly, rather than going about sending text copied into emails.  So, Nemo shared his Lab Cloud session that exhibited the problem with this author.

hlc-bug-1

Here is how sharing looks in the Helix Lab Cloud; taken from my account quite recently. Note how a couple of sessions are marked with “Guest”, indicating that I am a guest in those sessions. The other sessions are ones that I have created and control. The session is not running at this time.

This provided me with the state of the machine, the contents of the serial console, as well as everything on the disk of the machine, exactly like it was when Nemo left it.  Being in a different time zone, I opened up the session my next work day while Nemo was sleeping, and did some analysis.

The first operation was to just replay the previous attempts to build, just to see that things indeed fail.  They did.  But it is always good practice to check the basics first, and with the Lab Cloud automatic record and replay, it was dead easy to review the prior commands sent to the target system. I just clicked on “replay” on the snapshot in the timeline that Nemo told me showed the error, and the Lab Cloud replayed the session from that point. During the replay, I could not send any input to the target system. Instead, I could see everything that was typed into the target console by Nemo, and the output from the target system that resulted.

The Lab Cloud records every single character or network packet that goes into the lab system. You can review what was done even if the person who did it is not quite sure exactly what he or she did – the Lab Cloud gives you perfect memory! This always-on recording model reduces the mental load on the user, since there is no need to maintain a separate log of actions. If a user wants to help another user review their work, they should add snapshots at interesting points in the work, such as just before trying an important operation.

But I digress… what’s important is that the bug was definitely confirmed.  So, what was going on?

On a hunch, I decided to check the current target time, since there was some kind of complaint about time stamps.  This turned out to be a correct guess.  The target system thought the time was some time in 2008.  With files dated in 2015 being uploaded and then used in the build, it is pretty clear that a comparison between current time and the time of files would result in fairly unexpected results.  Especially when newly generated files feature a time-stamp older than that of the generating script itself. This wrong time setting was a configuration issue in the machine setup, where the person who generated and configured this particular platform setup forgot to set the virtual time of the simulated machine to the current time.  Small mistake, easy to make, but with fairly non-obvious side effects.

Access to the complete system was critical to perform this analysis. Without the ability to share the current state of the target system, we would have had to continue the dialog over email, trying one thing after the other, with rarely more than one or two attempts per day due to different time zones.  You all know how frustrating this can be, and how often support just says “get me a remote login so I can check things for myself”. With the session and its state and history shared inside the Helix Lab Cloud system, instead I could investigate anything I needed without a roundtrip over email. If I had a hunch, I could check on it immediately, rather than asking Nemo to do it for me.

Another benefit of sharing the session was that I could test and validate the proposed solution to the problem, in the system context of the original problem.  The fix (setting the date) was applied right in the system that exhibited the problem.  This achieved two things.  First of all, it made it very easy to show Nemo that I had indeed solved the problem. It also actually solved the problem for Nemo. He didn’t need to do anything himself to apply the fix, the system that I had tinkered with was the one he was using. Thus, Nemo could continue right where I left off. This is an incredibly powerful tool for supporting users and solving problems.

hlc-bug-2

Sharing in HLC maintains the ownership of the session, but the user that a session is shared to can add new snapshots at the end of the session. Since the owner can delete those snapshots and undo all actions, this is perfectly safe and controllable.

It is worth noting that the record, replay, and snapshot technology of the Lab Cloud also meant that everything I did was recorded for review – and possible to undo. In case I had managed to completely mess up the system while trying to fix it, Nemo (or I) could simply have rolled back the state of the system to before he shared it with me, and at least he would have been no worse off.  Nemo is still the owner of the session, as shown in the picture above, and has full authority to undo and roll back to a previous state.

To try the Helix Lab Cloud today, please go to the lab cloud page at https://lab.cloud.windriver.com, click “Register”, and use the invitation code BlogInvite. This offer will expire in October of 2015.

For more the Helix Lab Cloud, see some previous blog posts: