[Nix-dev] [RFC] Declarative Virtual Machines

Sun Apr 23 19:44:50 CEST 2017

OK, so, with Nadrieril we talked more about it, and here is what we came
up with.

The basic requirement is that the host be able to control the guest's
boot. Otherwise, if the guest config is not synchronized with what the
hosts expects, the guest may not be able to even boot properly if, say,
they don't agree on the network setup.

We saw three ways to achieve this:
 1/ Having the guest boot, then wait in the initrd for its
configuration, have the host push it, then the guest can continue
booting. This requires nothing special on the guest's FS.
 2/ Having the guest's /nix on a separate .qcow2 image. This way, the
host can decide to stop the guest, upgrade the store, then restart the
guest.
 3/ Having the guest's /nix on a virtfs. This way, the host can upgrade
the guest's store in-place.

There are drawbacks to all of these options:
 1/ makes the boot scheme complex to understand, and risks duplicating
some behaviour between the dropbear in the initrd and the ssh daemon
outside of it, in order to handle online upgrades
 2/ makes it really complex to handle online upgrades
 3/ appears to be slower and less stable

Do you see a simpler setup that would work as we want ? If not, which of
those seems more reasonable to you ?

On 04/23/2017 02:40 PM, Leo Gaspard wrote:
> On 04/22/2017 11:07 PM, Volth wrote:
>> Hello.
>>
>> There are few objections against qemu with shared /nix/store:
>>
>> 1. It is fast to create but slow to run. Boot time with shared
>> /nix/store is about twice slow than with everything on qcow2.
>>
>> 2. 9P is unstable, every couple of months there is a new bug (real
>> bugs, not CVEs: wrong data read, the driver got stuck, etc)
> 
> Hmm, I wasn't aware of these two first points (didn't test anything),
> intuitively virtfs was supposed to be faster in my mind, as it skipped
> one level of parsing the qcow2 image. Guess I should have actually
> tested instead of relying on gut feeling. That said, I'd like to ask
> whether you set the msize=262144 (or similarly high value) option for
> the 9p mount on the guest, during these benchmarks? It greatly
> influences performance when the 9p is not over network as it used to be
> designed for.
> 
> As for stability, in my test setup I haven't hit any non-permanent issue
> (things like being unable to chown / in a mapped-file mode have
> appeared, though, but it's not anything that would compromise the
> stability of a production system as it can be seen during development),
> so I assumed it was pretty stable.
> 
>> 3. host GC cannot see the runtime roots inside the VM, so all the
>> guest system closures from its last boot should be preserved from host
>> GC. It may be tricky to debug.
> 
> This is not really an issue, as the store is not shared with the guest,
> but rather a rsync of the part of the store that interests the guest (in
> order to avoid information leaks). So the guest never actually sees the
> host store.
> 
> The reason for picking 9p instead of qcow2 to hold this copy of the
> store was to allow upgrades to the VM without rebooting it (as the VM
> doesn't have access to its configuration it can't just perform the
> upgrade from the inside), so I thought that future work may include the
> host rsync'ing the relevant files into the 9p export path, and then just
> push a bash script at a shared place the guest would have a cron to
> execute as root, that would trigger a call to the new profile's
> switch-to-configuration.
> 
> This would also be possible with the store on a qcow2 image, but would
> entail also pushing all the store paths through this shared path and
> having the guest copy it to its nix-store. I guess it's possible and
> doesn't involve many drawbacks, except a time-to-upgrade quite increased
> due to the two copies instead of one.
> 
> In the downsides of using qcow2, I can see that if using a CoW FS (such
> as btrfs) shared between /nix/store and /var/lib/vm/${vmname}/store,
> it's possible to have the store of the guest take 0 additional space,
> while using a qcow2 image makes it much harder (and I don't think any
> widely used FS performs block-level deduplication, but I may be wrong)
> 
> So I'd love to hear other voices before switching from one to the other,
> as I'm pretty sure we're missing some other decision points.
> 
>> Also, the whole idea could be splited to simpler building blocks and
>> generalized to use with Virtualbox and different kind of containers.
>> One of the block could be, say, "nix-slave" - the NixOS install which
>> is always configured on an external machine and then run inside VM or
>> container or deployed to the cloud.
>> So it cannot do "nixos-rebuild" from inside and has limited set of
>> features, no profiles (no need to "boot previous version" if the
>> previous version could be written to the .qcow2 of a powered-off VM),
>> no "nix-env", etc
>> Then, a tool to make container/VM out of configuation.
>> Then, a VM-agnistic tool to configure network of that slaves.
>>
>> Well, it sounds very familiar.
>> We indeed have this pattern in so many places: NixOS containers,
>> NixOps, test-driver, "nixos-install build-vm", runInLinuxVM,
>> make-disk-image.nix, your proposal, etc
>> Each of them solves one narrow task and the code is not reuseful. For
>> example, when I need to create .qcow2 outside the nix store, or
>> install/repair nixos on exising .qcow2, I end up writing by own set of
>> tools (or using RedHat's libguestfs, which is... another VM appliance)
>> Perhaps, there could be some common ground which unifies that kind of
>> tasks as an alternative to creating new bloated tools with many
>> options?
> 
> I see you have already seen it, but just for the record, copumpkin has
> recently done great work in this domain, with nixos-prepare-root [1]
> (it's newly merged, so I didn't use it in my not-yet-PR'd changes, but
> it's on my todo-list before opening the PR related to this RFC)
> 
> This looks like exactly what you're looking for, except that it still
> requires to copy the generated root from a local directory to the right
> block device, which can anyway be done only in a way that heavily
> depends on which block device it is. It would be possible to do a
> make-disk-image as you talked about in comments to [2], but I don't
> think it would fit inside the scope of this RFC, rather in a nixpkgs
> refactoring (which AFAIU doesn't require a RFC).
> 
> Or did I miss your point here?
> 
> 
> [1] https://github.com/NixOS/nixpkgs/pull/23026
> 
> [2] https://github.com/NixOS/nixpkgs/pull/24964
> 
> 
> 
> _______________________________________________
> nix-dev mailing list
> nix-dev at lists.science.uu.nl
> https://mailman.science.uu.nl/mailman/listinfo/nix-dev
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 659 bytes
Desc: OpenPGP digital signature
URL: <http://mailman.science.uu.nl/pipermail/nix-dev/attachments/20170423/95db5f0d/attachment-0001.sig>