[Nix-dev] Proposal for hydra with librsync
Jeff Johnson
n3npq at mac.com
Tue May 11 14:19:59 CEST 2010
(aside)
Apologies for my English, which is peculier to me. Here's
a more complete explanation of what I tried to say.
On May 11, 2010, at 4:56 AM, Lluís Batlle i Rossell wrote:
> On Mon, May 10, 2010 at 06:48:52PM -0400, Jeff Johnson wrote:
>>
>> On May 10, 2010, at 6:32 PM, Lluís Batlle i Rossell wrote:
>>>
>>> I think that such a system, additionally to the rest of the nix system, would
>>> look astonishing compared to what other distros have.
>>>
>>
>> (aside)
>> Distributing with rsync is hardly state of the art, /nix/store _IS_
>> state of the art. I wouldn't bother with rsync "astonishment".
>
> I did not know a softer English word. :) Nevertheless, I wanted to point the
> rsync for binary distribution, not for source distribution as gentoo or MacOSX.
> Additionally, in our case we have quite enough more updates, and a client may
> have a quite similar file base; a dependency change may involve an update in
> lots of packages, and by now we have to reload them all.
>
Using rsync-like (instead of say, replacing *.gz with *.bz2 or *.xz) will
save bandwidth. That is the better engineering reason for using rsync, not
because its better than what other distros are doing, and saving bandwidth
with proper engineering is an impressive and "astonishing" per se, not in
comparison to other distros. I hope that explains what I attempted to say.
Using rsync-like protocols for transport is a very good idea.
> I did not do much tests with the rsync; I tried with kdelibs-4.4.3 yesterday
> (with the difference caused by a change in the openssl dependency); the tar.bz2
> of the store path amounted to 19MB, while rsync -zc could transfer the new
> having the old with only 4MB. That, without closures - using directly the store path
> directories. Maybe bsdiff would have done that with 200KB, but with much more
> memory.
>
Testing is needed because the issues with --rsyncable are subtle. The Huffman
dictionary used by gzip/zlib (and similar in other compressions) needs to be written
into the compressed file deterministically for rsync-like protocols to be able
to generate deltas efficiently. There are two approaches (i.e. what gzip --rsyncable
implements): 1) padding to rsync block boundaries 2) calling zflush at known points.
The subtle problems are
1) padding introduces ~1% overhead, and flushing adds ~0.1%, to the size of
the compressed files. SO it looks like you are increasing, not decreasing,
the size of the compressed files.
2) rsync deltas (just like all deltas) are between _BEFORE_ and _AFTER_
images. So --rsyncable needs to done on BOTH images for rsync to really
save bandwidth. Otherwise the potential savings will "work", but degrade
to sub-optimal deltafication.
> We can think of using Xdelta,
> http://code.google.com/p/xdelta/wiki/TuningMemoryBudget, which can work with a
> limited memory budget (not like bsdiff), and may provide even better reduction
> of transfer.
>
rsync/zsync is unique (imho) because all other deltafication schemes (xdelta,
or bsdiff/bspatch or Google Courgette) need the _BEFORE_ and _AFTER_ images
to be co-resident on the same machine. Only rsync/zsync use a 3 step protocol
1) client: generate block checksums of one image and send to server
2) server: use the block checksums and the other image to generate patch and send to client
3) client: apply patch
>> librsync is a zombie project. which is a shame, Martin Pool is a
>> really smart guy, just that the rsync protocol is patent encumbered
>> many years now.
>
> I did not know about the patent problems.
>
The issues are quite old. But in 1998 it was known that ~87%
of the bandwidth needed for displaying web pages could be saved
with an rsync-like protocol implemented in browsers. You would
think that saving 87% of the bandwidth used displaying web pages
would be a useful implementation. The implementation has never
happened because of the patent. Sadly the patent has never
produced a useful product afaik.
>> There's zsync which _IS_ active, and also has the benefit of
>> burning client, not server, CPU cycles. But I haven't looked
>> carefully at zsync for several years.
> worth a try.
>
>> For extra credit: Figger how to remove rsync's innate bias
>> on file paths so that Nix instances could be rsync'd directly.
>> Now _THAT_ would be spiffy and "astonishing".
> 'nix instances'? 'innate bias on file paths'? I don't know what do you talk
> about. Can you explain a bit?
Ah sorry. Here's an example from distributing RPM packages using rsync.
RPM package file names almost always have versioning information in the name:
foo-1.2-3.i386.rpm
When a new "foo-1.2-4.i386.rpm" is produced, then rsync between server <-> client
uses the path as a hint to _BEFORE_ and _AFTER_ images. All I'm describing is "mirroring"
here.
So (in the naive case) rsync cannot detect that "foo-1.2-3.i386.rpm" is a useful _BEFORE_
image when attempting to transport "foo-1.2-4.i386.rpm" efficiently with deltafication,
and the transport becomes just a full copy.
(aside) there's a "fuzzy" patch in rsync, basically choosing the most similar (according
to a soundex-like name comparison) path to use a _BEFORE_ image for deltaification
that "works" for simple problems like RPM packages with a version in the name.
But when there is a hash (the "nix instance") as part of the path, then simple
similarity comparisons like SOUNDEX just will not do, and rsync would have
difficulty identifying the most similar _BEFORE_ and _AFTER_ images needed
for deltafication.
So for Nix a different type of similarity comparison would need to be constructed
in order for rsync deltafication to "work" properly.
I hope that explains better.
> I don't mean, specially, to improve the rsync program. I'm interested in nix
> closures distribution, specially by the build farm.
>
BTW, the "fuzzy" patch (or whatever rsync heuristic is used to identify
_BEFORE_ and _AFTER_ images based on path) isn't hard and should be examined.
> About implementing... I'd like to implement a solution at least for
> 'nix-copy-closure' - whenever I find any time, I'll go with it, with whatever
> delta system I find best.
>
It sounds like you already understand the rsync issue that needs to be solved,
basically "closures", not paths, need to be used for similarity.
But if "closures" are unique, then you have the same problem again:
rsync needs _BEFORE_ and _AFTER_ images that are "similar"
(whatever that means) for efficient deltafication.
hth
And my apologies for the jargon in my 1st post.
73 de Jeff
More information about the nix-dev
mailing list