[Nix-dev] Monitoring by default

Fri Apr 22 16:08:26 CEST 2016

Thanks for the responses so far!

Let me see...

- I actually agree with Tomas about naming. I know I wrote
"services.monitoring.enable", but I hadn't put a lot of thought into that
sentence; "services.monitoring.prometheus" seems like a better namespace.

- I'd add battery life to the list of things we collect, but I don't have a
laptop to run NixOS on at present, and the "over the years" part of that
suggestion doesn't fit well with Prometheus, which runs up against
Prometheus being the only monitoring system I really know. Still, if I
write the monitoring console right, it should be possible to include other
services.

- I think "Generate instructions for enabling it in configuration.nix" is
the consensus so far. Let's do that. One question might be what to do about
already-installed systems, but I suppose I can write a blog post once this
is ready for use.

- I didn't know about systemd-cgtop. Neat.

- I'll try to keep pointless metric collection to a minimum, but some of
them aren't pointless. E.g. GPU memory use; there's really no call for
alerting based on that, ever, but it'd still be useful for informing
purchase decisions. (They tend to swap into main memory, which won't
register.)

The problem with turning on collection only once you have a problem, is
that it may turn out you really needed the data from ten hours ago; e.g.
imagine seeing a graph where disk usage takes a sharp spike after the
latest system upgrade. That's trivial for humans to spot, but moderately
difficult for computers. (And it won't happen if we don't collect.)

The data isn't going to make its way off the computer. My main concern is
memory usage... for the moment, I'm aiming for a ballpark figure of 100MB
for the service overall. Of course, it can write metrics out to disk; say
1GB there. Those will be configurable, but there's only so much you can cut
before alerts stop working.

- For "System software too old", what I'm really aiming at is an alert
saying "Hey, you haven't upgraded your system in half a year, and there's
probably some security flaws in there by now." The details are TBD; since
we don't have people watching all the security lists and keeping Nix up to
date, using age-of-installed-channel as a proxy is probably the best we can
do.

A lot of the things I'd like to alert on, on my own system, would be mildly
questionable on others. I'll try to include config options where it makes
sense to me, but... this sort of detailed discussion is really for the PRs.

- And I don't suppose I can deny being at Google, no, but this isn't a
Google project; I've just ended up using NixOS *everywhere* in my personal
life, so now I want monitoring. ;-)

On Thu, Apr 21, 2016 at 9:41 PM Layus <layus.on at gmail.com> wrote:

> I like the idea too.
>
> It seems to me that distributions really lack metrics collection and
> data analysis.
> For example, it would be nice to have automatic gathering of the battery
> usage (charge/discharge/capacity) and an easy access to compiled
> historical data like the capacity loss over the years.
>
> I know this is far less ambitious than what you describe, but it would
> be a great entry point for new users.
> If they like it, they may want to gather wore statistics.
>
> Anyway, I think it makes total sense to have such a feature in NixOS.
>
> -- Layus.
>
> On 20/04/16 11:49, Rok Garbas wrote:
> > +1 for the initiative. i don't believe personally enabling monitoring
> > by default should be the right way to go (since we all use nixos in
> > different contexts), but having a commented instructions in generated
> > configurations.nix would be the way to go.
> >
> > it would be nice if systemd monitoring stuff could be used as well:
> > https://github.com/garbas/dotfiles/blob/master/nixos/rok.nix#L236
> > above line makes systemd-cgtop showing numbers.
> >
> >
> >
> > On Wed, Apr 20, 2016 at 8:44 AM, Alexei Robyn <shados at shados.net> wrote:
> >> Seems interesting. You mention alerts for "System software too old.",
> but
> >> the only vaguely-universal definition of "too old" I can think of would
> be
> >> "missing security updates", and that's both debatable and an area where
> >> NixOS is currently fairly lacking in infrastructure and tooling.
> >>
> >> Default collection of metrics beyond what is necessary to provide useful
> >> alerts is a bad idea. Alerts have essentially universal usefulness,
> >> statistics less so - they're unnecessary for most desktops and small
> >> servers. At least until you have issues, so of course it'd be nice if
> they
> >> were easy to switch on :p.
> >>
> >> - Alexei
> >>
> >>
> >> On Wed, Apr 20, 2016, at 12:40 AM, Svein Ove Aas wrote:
> >>
> >> Hi all,
> >>
> >> People who are not interested in reliability or monitoring can stop
> reading
> >> now.
> >>
> >> --
> >>
> >> I've written up a "design doc" (statement of intent?) for how we might
> do
> >> monitoring-by-default. Once I think there is a reasonable level of
> consensus
> >> about how we should do this, I'll go ahead and implement what's in the
> >> document, but I'd like to make sure we're all on the same page first;
> >> especially as I want this to be on by default.
> >>
> >> So I'd like your input. Can you take a look?
> >>
> >> --
> >> Svein Ove Aas
> >> _______________________________________________
> >> nix-dev mailing list
> >> nix-dev at lists.science.uu.nl
> >> http://lists.science.uu.nl/mailman/listinfo/nix-dev
> >>
> >>
> >>
> >> _______________________________________________
> >> nix-dev mailing list
> >> nix-dev at lists.science.uu.nl
> >> http://lists.science.uu.nl/mailman/listinfo/nix-dev
> >>
> >
> >
>
> _______________________________________________
> nix-dev mailing list
> nix-dev at lists.science.uu.nl
> http://lists.science.uu.nl/mailman/listinfo/nix-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.science.uu.nl/pipermail/nix-dev/attachments/20160422/0f3d31ce/attachment.html