summaryrefslogtreecommitdiffstats
path: root/src/VBox/ValidationKit/docs/TestBoxImaging.txt
blob: c84e495134ca9c8304d2fb0610a058c5a5271377 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
Testbox Imaging (Backup / Restore)
==================================


Introduction
------------

This document is explores deploying a very simple drive imaging solution to help
avoid needing to manually reinstall testboxes when a disk goes bust or the OS
install seems to be corrupted.


Definitions / Glossary
======================

See AutomaticTestingRevamp.txt.


Objectives
==========

 - Off site, no admin interaction (no need for ILOM or similar).
 - OS independent.
 - Space and bandwidth efficient.
 - As automatic as possible.
 - Logging.


Overview of the Solution
========================

Here is a brief summary:

 - Always boot testboxes via PXE using PXELINUX.
 - Default configuration is local boot (hard disk / SSD)
 - Restore/backup action triggered by machine specific PXE config.
 - Boots special debian maintenance install off NFS.
 - A maintenance service (systemd style) does the work.
 - The service reads action from TFTP location and performs it.
 - When done the service removes the TFTP machine specific config
   and reboots the system.

Maintenance actions are:
 - backup
 - backup-again
 - restore
 - refresh-info
 - rescue

Possible modifier that indicates a subset of disk on testboxes with other OSes
installed.  Support for partition level backup/restore is not explored here.


How to use
----------

To perform one of the above maintenance actions on a testbox, run the
``testbox-pxe-conf.sh`` script::

    /mnt/testbox-tftp/pxeclient.cfg/testbox-pxe-conf.sh 10.165.98.220 rescue

Then trigger a reboot.  The box will then boot the NFS rooted debian image and
execute the maintenance action.  On success, it will remove the testbox hex-IP
config file and reboot again.


Storage Server
==============

The storage server will have three areas used here.  Using NFS for all three
avoids extra work getting CIFS sharing right too (NFS is already a pain).

 1. /export/testbox-tftp    - TFTP config area.  Read-write.
 2. /export/testbox-backup  - Images and logs.  Read-write.
 3. /export/testbox-nfsroot - Custom debian.  Read-only, no root squash.


TFTP (/export/testbox-tftp)
============================

The testbox-tftp share needs to be writable, root squashing is okay.

We need files from both PXELINUX and SYSLINUX to make this work now.  On a
debian system, the ``pxelinux`` and ``syslinux`` packages needs to be
installed.  We actually do this further down when setting up the nfsroot, so
it's possible to get them from there by postponing this step a little.  On
debian 8.6.0 the PXELINUX files are found in ``/usr/lib/PXELINUX`` and the
SYSLINUX ones in ``/usr/lib/syslinux``.

The initial PXE image as well as associated modules comes in three variants,
BIOS, 32-bit EFI and 64-bit EFI.  We'll only need the BIOS one for now.
Perform the following copy operations::

  cp /usr/lib/PXELINUX/pxelinux.0 /mnt/testbox-tftp/
  cp /usr/lib/syslinux/modules/*/ldlinux.* /mnt/testbox-tftp/
  cp -R /usr/lib/syslinux/modules/bios  /mnt/testbox-tftp/
  cp -R /usr/lib/syslinux/modules/efi32 /mnt/testbox-tftp/
  cp -R /usr/lib/syslinux/modules/efi64 /mnt/testbox-tftp/


For simplicity, all the testboxes boot using good old fashioned BIOS, no EFI.
However, it doesn't really hurt to be prepared.

The PXELINUX related files goes in the root of the testbox-tftp share.  (As
mentioned further down, these can be installed on a debian system by running
``apt-get install pxelinux syslinux``.)  We need the ``*pxelinux.0`` files
typically found in ``/usr/lib/PXELINUX/`` on debian systems (recent ones
anyway).  It is possible we may need one ore more fo the modules [1]_ that
ships with PXELINUX/SYSLINUX, so do copy ``/usr/lib/syslinux/modules`` to
``testbox-tftp/modules`` as well.


The directory layout related to the configuration files is dictated by the
PXELINUX configuration file searching algorithm [2]_.  Create a subdirectory
``pxelinux.cfg/`` under ``testbox-tftp`` and create the world readable file
``default`` with the following content::

  PATH bios
  DEFAULT local-boot
  LABEL local-boot
  LOCALBOOT

This will make the default behavior to boot the local disk system.

Copy the ``testbox-pxe-conf.sh`` script file found in the same directory as
this document to ``/mnt/testbox-tftp/pxelinux.cfg/``.  Edit the copy to correct
the IP addresses near the top, as well as any linux, TFTP and PXE details near
the bottom of the file.   This script will generate the PXE configuration file
when performing maintenance on a testbox.


Images and logs (/export/testbox-backup)
=========================================

The testbox-backup share needs to be writable, root squashing is okay.

In the root there must be a file ``testbox-backup`` so we can easily tell
whether we've actually mounted the share or are just staring at an empty mount
point directory.

The ``testbox-maintenance.sh`` script maintains a global log in the root
directory that's called ``maintenance.log``.  Errors will be logged there as
well as a ping and the action.

We use a directory layout based on dotted decimal IP addresses here, so for a
server with the IP 10.40.41.42 all its file will be under ``10.40.41.42/``:

``<hostname>``
  The name of the testbox (empty file).  Help finding a testbox by name.

``testbox-info.txt``
  Information about the testbox.  Starting off with the name, decimal IP,
  PXELINUX style hexadecimal IP, and more.

``maintenance.log``
  Maintenance log file recording what the maintenance service does.

``disk-devices.lst``
  Optional list of disk devices to consider backuping up or restoring.  This is
  intended for testboxes with additional disks that are used for other purposes
  and should touched.

``sda.raw.gz``
  The gzipped raw copy of the sda device of the testbox.

``sd[bcdefgh].raw.gz``
  The gzipped raw copy sdb, sdc, sde, sdf, sdg, sdh, etc if any of them exists
  and are disks/SSDs.


Note! If it turns out we can be certain to get a valid host name, we might just
      switch to use the hostname as the directory name instead of the IP.


Debian NFS root (/export/testbox-nfsroot)
==========================================

The testbox-nfsroot share should be read-only and must **not** have root
squashing enabled.  Also, make sure setting the set-uid-bit is allowed by the
server, or ``su` and ``sudo`` won't work

There are several ways of creating a debian nfsroot, but since we've got a
tool like VirtualBox around we've just installed it in a VM, prepared it,
and copied it onto the NFS server share.

As of writing debian 8.6.0 is current, so a minimal 64-bit install of it was
done in a VM.  After installation the following modifications was done:

 - ``apt-get install pxelinux syslinux initramfs-tools zip gddrescue sudo joe``
   and optionally ``apt-get install smbclient cifs-utils``.

 - ``/etc/default/grub`` was modified to set ``GRUB_CMDLINE_LINUX_DEFAULT`` to
   ``""`` instead of ``"quiet"``.  This allows us to see messages during boot
   and perhaps spot why something doesn't work on a testbox.  Regenerate the
   grub configuration file by running ``update-grub`` afterwards.

 - ``/etc/sudoers`` was modified to allow the ``vbox`` user use sudo without
   requring any password.

 - Create the directory ``/etc/systemd/system/getty@tty1.service.d`` and create
   the file ``noclear.conf`` in it with the following content::

     [Service]
     TTYVTDisallocate=no

   This stops getty from clearing VT1 and let us see the tail of the boot up
   messages, which includes messages from the testbox-maintenance service.

 - Mount the testbox-nfsroot under ``/mnt/`` with write privileges.  (The write
   privileges are temporary - don't forget to remove them later on.)::

     mount -t nfs myserver.com:/export/testbox-nfsroot

   Note! Adding ``-o nfsvers=3`` may help with some NTFv4 servers.

 - Copy the debian root and dev file system onto nfsroot.  If you have ssh
   access to the NFS server, the quickest way to do it is to use ``tar``::

     tar -cz --one-file-system -f /mnt/testbox-maintenance-nfsroot.tar.gz . dev/

   An alternative is ``cp -ax . /mnt/. &&  cp -ax dev/. /mnt/dev/.`` but this
   is quite a bit slower, obviously.

 - Edit ``/etc/ssh/sshd_config`` setting ``PermitRootLogin`` to ``yes`` so we can ssh
   in as root later on.

 - chroot into the nfsroot: ``chroot /mnt/``

     - ``mount -o proc proc /proc``

     - ``mount -o sysfs sysfs /sys``

     - ``mkdir /mnt/testbox-tftp /mnt/testbox-backup``

     - Recreate ``/etc/fstab`` with::

         proc                             /proc               proc  defaults   0 0
         /dev/nfs                         /                   nfs   defaults   1 1
         10.42.1.1:/export/testbox-tftp   /mnt/testbox-tftp   nfs   tcp,nfsvers=3,noauto  2 2
         10.42.1.1:/export/testbox-backup /mnt/testbox-backup nfs   tcp,nfsvers=3,noauto  3 3

       We use NFS version 3 as that works better for our NFS server and client,
       remove if not necessary.  The ``noauto`` option is to work around mount
       trouble during early bootup on some of our boxes.

     - Do ``mount /mnt/testbox-tftp && mount /mnt/testbox-backup`` to mount the
       two shares.  This may be a good time to execute the instructions in the
       sections above relating to these two shares.

     - Edit ``/etc/initramfs-tools/initramfs.conf`` and change the ``MODULES``
       value from ``most`` to ``netboot``.

     - Append ``aufs`` to ``/etc/initramfs-tools/modules``.  The advanced
       multi-layered unification filesystem (aufs) enables us to use a
       read-only NFS root. [3]_ [4]_ [5]_

     - Create ``/etc/initramfs-tools/scripts/init-bottom/00_aufs_init`` as
       an executable file with the following content::

         #!/bin/sh
         # Don't run during update-initramfs:
         case "$1" in
             prereqs)
                 exit 0;
                 ;;
         esac

         modprobe aufs
         mkdir -p /ro /rw /aufs
         mount -t tmpfs tmpfs /rw -o noatime,mode=0755
         mount --move $rootmnt /ro
         mount -t aufs aufs /aufs -o noatime,dirs=/rw:/ro=ro
         mkdir -p /aufs/rw /aufs/ro
         mount --move /ro /aufs/ro
         mount --move /rw /aufs/rw
         mount --move /aufs /root
         exit 0

     - Update the init ramdisk: ``update-initramfs -u -k all``

       Note! It may be necessary to do ``mount -t tmpfs tmpfs /var/tmp`` to help
             this operation succeed.

     - Copy ``/boot`` to ``/mnt/testbox-tftp/maintenance-boot/``.

     - Copy the ``testbox-maintenance.sh`` file found in the same directory as this
       document to ``/root/scripts/`` (need to create the dir) and make it
       executable.

     - Create the systemd service file for the maintenance service as
       ``/etc/systemd/system/testbox-maintenance.service`` with the content::

         [Unit]
         Description=Testbox Maintenance
         After=network.target
         Before=getty@tty1.service

         [Service]
         Type=oneshot
         RemainAfterExit=True
         ExecStart=/root/scripts/testbox-maintenance.sh
         ExecStartPre=/bin/echo -e \033%G
         ExecReload=/bin/kill -HUP $MAINPID
         WorkingDirectory=/tmp
         Environment=TERM=xterm
         StandardOutput=journal+console

         [Install]
         WantedBy=multi-user.target

     - Enable our service: ``systemctl enable /etc/systemd/system/testbox-maintenance.service``

     - xxxx ... more ???

     - Before leaving the chroot, do ``mount /proc /sys /mnt/testbox-*``.


 - Testing the setup from a VM is kind of useful (if the nfs server can be
   convinced to accept root nfs mounts from non-privileged clinet ports):

     - Create a VM using the 64-bit debian profile.  Let's call it "pxe-vm".
     - Mount the TFTP share somewhere, like M: or /mnt/testbox-tftp.
     - Reconfigure the NAT DHCP and TFTP bits::

         VBoxManage setextradata pxe-vm VBoxInternal/PDM/DriverTransformations/pxe/AboveDriver       NAT
         VBoxManage setextradata pxe-vm VBoxInternal/PDM/DriverTransformations/pxe/Action            mergeconfig
         VBoxManage setextradata pxe-vm VBoxInternal/PDM/DriverTransformations/pxe/Config/TFTPPrefix M:/
         VBoxManage setextradata pxe-vm VBoxInternal/PDM/DriverTransformations/pxe/Config/BootFile   pxelinux.0

     - Create the file ``testbox-tftp/pxelinux.cfg/0A00020F`` containing::

         PATH bios
         DEFAULT maintenance
         LABEL maintenance
           MENU LABEL Maintenance (NFS)
           KERNEL maintenance-boot/vmlinuz-3.16.0-4-amd64
           APPEND initrd=maintenance-boot/initrd.img-3.16.0-4-amd64 ro ip=dhcp aufs=tmpfs \
                  boot=nfs root=/dev/nfs nfsroot=10.42.1.1:/export/testbox-nfsroot
         LABEL local-boot
         LOCALBOOT


Troubleshooting
===============

``PXE-E11`` or something like ``No ARP reply``
  You probably got the TFTP and DHCP on different machines.  Try move the TFTP
  to the same machine as the DHCP, then the PXE stack won't have to do any
  additional ARP resolving.  Google results suggest that a congested network
  could use the ARP reply to get lost. Our suspicion is that it might also be
  related to the PXE stack shipping with the NIC.



-----

.. [1] See http://www.syslinux.org/wiki/index.php?title=Category:Modules
.. [2] See http://www.syslinux.org/wiki/index.php?title=PXELINUX#Configuration
.. [3] See https://en.wikipedia.org/wiki/Aufs
.. [4] See http://shitwefoundout.com/wiki/Diskless_ubuntu
.. [5] See http://debianaddict.com/2012/06/19/diskless-debian-linux-booting-via-dhcppxenfstftp/


-----

:Status: $Id: TestBoxImaging.txt $
:Copyright: Copyright (C) 2010-2020 Oracle Corporation.