Adam Herzog

Happy SysAdmin Day: Our Favorite SysAdmin War Stories

Thank you everyone who submitted and be sure to thank your SysAdmins today!

56772335We received so many awesome submissions, it was very difficult to pick only a few war stories to share.

Members of the Docker team voted and picked our top SysAdmin war story – congratulations to Sam Burton for submitting our favorite story!

But we also wanted to share some of the other great entries – scroll down to read more SysAdmin war stories.

Happy SysAdmin Day!

Sam Burton, SysAdmin at DGPVP LLC

So recently I retired as a SysAdmin for a medium sized network of game servers and all was going pretty well. I hadn’t yet had my SSH keys removed and I couldn’t be bothered to delete them on my end so I just left them. Anyway it’s a Saturday night and I’ve got the drinking gang and a few others round for a par tay. After a few drinks, four of us decide to go on an adventure outside. After walking for a few miles we find a nice spot and decide to sit down. That’s when I noticed I had a few notifications on my phone but I decided to ignore them. Eventually I get a call on my mobile and it’s a member of staff from a team of people I left a few days earlier, this is when all hell broke loose. I’m told that the styling on our forums had changed and I’d been posting things on the forums that I shouldn’t have done. I then realised that my slightly drunk self had forgot to lock my own computer and the people that I left behind at my house had access. I then went into full panic mode, I’m like 2 miles from my house and these people have access to a number of remote servers. Anyway, I managed to get back to my house and it was even worse than I first thought. They’d completely messed up our discourse installation and was in such a state that the in built discourse backup system wouldn’t even save us. However, thanks to the beauty of docker and the brilliant work from those at discourse, I could completely rebuild the container and restore the data in a matter of minutes with next to no configuration which was a relief to my drunken self. The moral of the story, create plentiful remote backups and ALWAYS take responsibility and remove SSH keys that you no longer need otherwise all hell will break loose and you will want to go cry in a corner.


 

Here are some of the Docker team’s other favorite submissions:

 

Xavier Baude, System Engineer, Adeo

Oh, you know what ? I have an application which does not support multi-instantiation, so I manage 150 servers… Thanks docker : Now I have only 4 servers for the same performances and results.


Randy Johnson, Bioinformatics Analyst, Leidos Biomedical, Inc.

One of our database admin folks slipped a small script onto our servers one year shortly before April 1. On the morning of April 1 we received a lot of visits and help requests from people reporting that something was terribly wrong with the database. Upon logging in to see what was going on I was presented with a choice: “Delete my hard drive” or “Blow up my computer”. I chose to click the button to blow up my computer and the login script executed normally from that point on.


99b28b318bead246ddeca856bd1be2b0271540efe9c8c3fa84d4104b564e8c56

 

Felix Barbeira, Linux Sysadmin, Dinahosting

This happens a few years ago when I used to work with a couple of friends as a consultant for other companies doing sysadmin tasks to the naive clients that dare to employ us. Fortunately now I’m employed in a serious company and I’m proud of my work.

Like I said before, an incautious client contacted us and told that he was a little bit scared about their cabinet disks (direct attached storage). It looks like the firmware was like 3 or 4 versions outdated and they *need* the last stable version in order to use a brand new function that the hardware manufacturer just came out to light. It seems that the star project the company was developing the last few months *require* this function and guess what, it happens “it works on my machine” history with the programmers. Once they upload the code to the cabinets, the application started to broke anywhere. The programmers didn’t have a clue about “that big machine where we upload the code”. So the scared boss decided to hire the company where I used to work in order to fix that.

We had to upgrade three versions of firmware in the same night, one after another, crossing fingers hoping that everything worked fine. If something failed we have to restore TERABYTES of data because that cabinet was used to store backups, financial information, employee payroll (glups!), etc.

It was Saturday night 02:00 and we started the first firmware upgrade, it was java so we have to cross our fingers and do nothing more than pray to the lord. The first upgrade was fine, so we tried the second and finally the third. We were almost clapping our hands but then the scroll bar that showed the firmware progress STARTED TO GO BACKWARD. Then we lost control and started to panic. Just when we were thinking about the restore procedure when someone yell “DON’T PANIC!! just look firmly the progress bar and wait a few minutes more…”. It was 10 minutes of fear and darkness but suddenly the progress bar started to move FORWARD!! (everyone knows that java must be made of some kind of strange alien technology) so the firmware upgrade completed successfully and the data was accessible half hour ago.

We finish the job and one week later the company that hire us was able to launch their star product. It was a good end but sometimes the “move forward java bar” chase me in my nightmares…

Maybe in a near future in my actual company will have to deal with some direct-attach storage, for example launching dockers that reads the data on these cabinets, and write to units also stored on “that big machine”. For sure it will help the to give the strength the docker t-shirt so I will be the envy of my coworkers. Best regards.


Jani Tiainen, Software Designer, Keypro Oy

This war story dates back in long way in the history when I was a sysadmin and responsible for our software development tools like repositories, issue manager and such.

I was working on converting all our source code repositories from CVS to Subversion. It was rather dull job, basically I just used import tool per CVS repository and and then deleted CVS repository.

But then things went really sideways. I was a bit tired and accidentally typed: rm -rf . /cvs-repo. As you note by some means I managed to accidentally type a space between dot and slash. And you all know what it means – big panic.

I wasn’t really sure did I had any recent backups at hand and I didn’t recalled that I made any extra backups. You can only imagine the feeling when you realize that in the worst case I just flushed down almost 6 years of development history down the drain!

If I would be using Docker (well it even didn’t existed back then) like I do today I would have been saved from so much troubles. I would just copied all repos inside container and run whole conversion process there. No harm done if something would have gone wrong.

Fortunately after I recovered initial shock to stare empty directory I remembered doing actually a backup before I begun whole operation.

After few cups of coffee and a bit of fresh air I finished the work successfully. I really wish I would have Docker back then at hand, would have saved many missed heart beats and sweating.


'

 

Mike Colson, Sr. vSpecialist, EMC

Relatively early days of virtualization and I came to a new organization and team. They were running physical everything and I had just come from an environment where we virtualized almost 90%. I start taking stock and realize that most of our servers are way underutilized a great case for virtualizing. So I talk to the NOC folks and say “Hey have you thought about virtualizing the domain controllers the hardware they are on is ancient and using less than 1/4 of their capacity.”

“Sir I relieve you are new here but do you know what a domain controller does?”

Now at this point in my career I had about 8 years of network and Citrix administration under my belt but I said, “I think so but why don’t you explain it to me just in case.”

“The domain controller … Well It controls the DOMAIN.”

It was at that point I realized that I had some education to do, we virtualized the Domain Controllers shortly thereafter and got them to 80% virtualized.


Mani Chandrasekaran, Advisory Consultant, EMC

I have always been a developer, but have had to moonlight as a system administrator for many projects when the client had no staff or unwilling to do so. I was the technical lead for a team, and I was asked to setup a cluster to a bunch of servers. When I entered the data center, I had no idea how to connect a bunch of servers with both both public network connections and internal cluster interconnects !! This had to be cone for 20+ servers !! Myself and a bunch of volunteers opened the manuals and sat on the cold data center floor and started wringing the network cables to the right slots. Fortunately a couple of wrongly cabled servers, the rest of the servers were done correct. My appreciation of sysadmins rose from that day onwards and have spent many times inside the cold data centers amidst the whirring noises.


7klLY7z

 

Ivan Lopez, Engineer, Kaleidos

We’ve been using Gitlab as our private git repository for a while. I’m the sysadmin that installed it and I’m in charge of upgrading to new versions. The first time that I tried to upgrade it was several months after the installation so I skipped 7 or 8 releases.

I did itduring the work time and without trying in another pre-production environment. What happened was that it fails and I had to restore abackup very quickly because my co-workers can’t connect to the repository.

If I had used docker for our private Gitlab that wouldn’t happened because I would have created a new container from our image, apply the changes without touching our online instance.

Lesson learned the hard way.

On the other hand, I’m right now in a project with a lot of components and the last week I’ll create containers for all the components: postgres, rabbit, nginx, Grails application, 5 Spring Boot applications and the frontend. Now, create a new demo environment or startup all the elements in our laptops is just as simple as docker-compose up 🙂


Daniel Kraaijl, Devops, Bax-Shop.nl B.V.

So, a couple of years ago (about 6) i wish i had Docker at my disposal. Running a platform for 2 major dutch health websites wasn’t your average cup-of-tea.

Installing similar machines always resulted in behavioral differences. If we had docker back then deploying new machines would have been a breeze.

We were running most boxes on RH9 or Centos 3 and scaling up more boxes was always a pain.

Luckily now we have Docker 😀


bhioiv

 

Nathan Lacey, System Administrator, Source Intelligence

Back in the day, there were a bunch of IMAP servers. Since this was long ago, they were running Linux with the 2.4 kernel. They started out storing their mail on locally attached 72 GB SCSI disks, organized simply with one ext2 filesystem per disk, but then they moved the storage to a faster and more sophisticated SAN backend with RAID-10 arrays (still on small fast enterprise disks), giving each server node a single logical array (on a dedicated set of drives) and data filesystem (still ext2).

Not too long after the move to the SAN, the servers started falling over every so often, unpredictably; their load average would climb to the many hundreds (we saw load averages over 700), IMAP response times went into the toilet, and eventually the machine would have to be force-booted. However, nothing obvious was wrong with the system stats (at least nothing that seemed to correlate with the problems). Somewhat through luck, we discovered that the time it took to touch and then remove a file in the data filesystem was closely correlated to the problem; when the time started going up, the system was about to get hammered. In the end, this led us to the answer.

Ext2 keeps track of allocated inodes (and allocated blocks) in bitmap blocks in the filesystem. In Linux 2.4, all changes to these bitmaps for a single filesystem were serialized by a single filesystem-wide kernel mutex, so only one process could be allocating or freeing an inode at a time. In the normal course of events, this is not a problem; most filesystems do not have a lot of inode churn, and if they do the bitmap blocks will all stay cached in system RAM and so getting the mutex, updating the bitmap, and releasing the mutex will normally be fast.

What had happened with us is that this broke down. First, we had a lot of inode churn because IMAP was creating (and then deleting) a lot of lockfiles. This was survivable when the system had a lot of separate filesystems, because each of them had a separate lock and not that many bitmap blocks. But when we moved to the SAN we moved to a single big filesystem; this meant both a single lock for all file creation and deletion, and that the filesystem had a lot of bitmap blocks.

(I believe that pretty much the same amount of disk space was in use in both cases; it was just organized differently.)

This could work only as long as either almost all of the bitmap blocks stayed in cache or we didn’t have too many processes trying to create and delete files. When we hit a crucial point in general IO load and memory usage on an active system, the bitmaps blocks started falling out of cache, more and more inode operations had to read bitmap blocks back in from disk while holding the mutex (which meant they took significant amounts of time), and more and more processes piled up trying to get the mutex (which was the cause of the massive load average). Since this lowered how frequently any particular bitmap block was being used, it made them better and better candidates for eviction from cache and made the situation even worse.

(Of course, none of this showed up on things like iostat because general file IO to do things like read mailboxes was continuing normally. Even the IO to read bitmap blocks didn’t take all that long on a per-block basis; it was just that it was synchronous and a whole lot of processes were effectively waiting on it.)

Fortunately, once we understood the problem we could do a great deal to mitigate it, because the lockfiles that the IMAP server was spending all of that time and effort to create were just backups to its fcntl() based locking. So we just turned them off, and things got significantly better.

(The overall serialized locking problem was fixed in the 2.6 kernel as part of work to make ext2 and ext3 more scalable on multiprocessor systems, so you don’t have to worry about it today.)


Kevin McKeever, System Administrator, Aurion

Just before updating the first production database system with a well thought out sql query for a large health insurance company, I fired up another screen on the test system. I got distracted by a phone call and returned to enter the delete from transactions table (on the test database). My boss came running into me 4 minutes later saying all the transactions on the live system are GONE. I turned and realised that the screen on the left was the live database and the screen on the right was the test! Arhhhh. A long night was then spent doing a full restore and a few scars were had.


Francisco Javier Tsao Santín, GPUL

Some time ago I was working as Linux sysadmin in a major company. Our team were in charge of the operating system, but other teams were the applications administrators. So in some circumstances we allowed them some privileged commands via sudo. The could do some services installs/patching in this manner.

One day I received a phone call from one of our users. He said me there was a server with a erratic behavior. I tried to ssh on it. Connection refused. I tried to log in from the console, and I only could see weird messages.

So I boot the server in rescue mode with a OS iso. I mounted the filesystems. And I began to see someone was changed all the permissions in all the system. I investigated for a while, I could discover who was the guilty, and the command that executed, a sudo chmod -R something /

How we can recover the server in a situation like this? With previous steps (changing some permissions on hand, chrooting) we do it using the rpm database:

for p in $(rpm -qa); do rpm --setperms $p; done
for p in $(rpm -qa); do rpm --setugids $p; done

We had a SUSE server in our case, so I did an additional step:

/sbin/conf.d/SuSEconfig.permissions

And… of course, I never had this problem if the application was jailed in a Docker container (and the user that run the chmod in the state prison ;-))

 


 

 Learn More about Docker

 

,

Adam Herzog

Happy SysAdmin Day: Our Favorite SysAdmin War Stories


Leave a Reply

Get the Latest Docker News by Email

Docker Weekly is a newsletter with the latest content on Docker and the agenda for the upcoming weeks.