Proposal and Request For Feedback: Implement `dnf countme`

16 Dec 2021

      Hello I am Jonathan Wright, Infrastructure Team Lead for AlmaLinux. I
manage most of the plumbing that keeps things humming smoothly along and
I’ve been working on some improvements to some parts of it to make things
more user friendly for our community.

AlmaLinux values transparency <https://wiki.almalinux.org/Transparency.html>
and communal decision making, it’s one of the reasons why I decided to
become a contributor. As part of some of the work I’m doing I’d like to
request some feedback from the community on a proposal to enable `dnf
countme` similar to the way the Fedora project does.

countme is a core feature of DNF implemented upstream in Fedora 32 (dnf
4.2.9).  It is described by the docs as such:

Determines whether a special flag should be added to a single, randomly
chosen metalink/mirrorlist query each week. This allows the repository
owner to estimate the number of systems consuming it, by counting such
queries over a week's time, which is much more accurate than just counting
unique IP addresses (which is subject to both overcounting and
undercounting due to short DHCP leases and NAT, respectively).

The flag is a simple "countme=N" parameter appended to the metalink and
mirrorlist URL, where N is an integer representing the "longevity" bucket
this system belongs to. The following 4 buckets are defined, based on how
many full weeks have passed since the beginning of the week when this
system was installed: 1 = first week, 2 = first month (2-4 weeks), 3 = six
months (5-24 weeks) and 4 = more than six months (> 24 weeks). This
information is meant to help distinguish short-lived installs from
long-term ones, and to gather other statistics about system lifecycle.

countme was designed with privacy in mind and does not add any identifying
or unique information to requests so there is no tracking involved. Just a
simple “hello” to the repository.

Currently, AlmaLinux does not track any sort of usage statistics for our
distribution at all. We can technically try to aggregate basic metrics from
HTTP logs on our mirrorlist servers but the reliability of the data will
not be the best since counting unique IPs is undermined by things like NAT
and dynamic addressing. So, I’d like to propose we implement “countme=1” in
our repository configs just as Fedora and EPEL have done. I’d also like to
propose that the aggregated data be made available publicly, similar to
https://data-analysis.fedoraproject.org/ for the community to see.

I’ve setup a form for feedback at https://forms.gle/BShXoxJmsjNbMXCk6 in
case you’d like to give any input on this proposal. We will keep this form
open for about a week.

FAQ:

Q: When are “countme” requests sent?
A: Once a week at random during normal dnf activity.  If you do not use dnf
calls that would otherwise trigger mirrorlist requests (makecache, install,
update) this flag will NOT cause dnf to go out of its way and make special
requests.

Q: What extra data will be sent that is not currently collected?
A: “countme=X” will be added to a random mirrorlist request each week from
DNF where X is a number, 1-4 which represents the number of weeks your
system has been installed.  See above for the explanation of this from the
DNF documentation.

Q: Will aggregated data be made publicly available?
A: Yes

Q: What data do you use?
A: The only data we look at is in the HTTP request itself. Our log lines
are in the standard Combined Log Format.  Ex:
172.30.61.81 - - [15/Dec/2021:17:02:12 +0000] "GET
/mirrorlist/8/baseos?countme=4 HTTP/1.1" 200 629 "-" "libdnf (AlmaLinux
8.3; generic; Linux.x86_64)"

We only look at log lines where the request is "GET", the query string
includes "countme=N", the result is 200 or 302, and the User-Agent string
matches the libdnf User-Agent header.

The only data we use are the timestamp, the query parameters (repo, arch,
countme), and the libdnf User-Agent data.

In the future we will also aggregate data by country using GeoIP.  Our
processing and aggregation does not care about IPs themselves or their
uniqueness.  When we implement the aggregation of geographic data it will
use MaxMind’s GeoIP database locally to turn the IP into a region which
will be used for tallying generalized metrics for that region.

Raw access logs are archived in case we find major issues in any of our
processing which would allow us to re-parse the data in the future and
correct the published statistics.

Q: Can I opt out?
A: Yes, but we’d prefer you not since the data is very helpful.  The only
extra data you’ll be submitting is “countme=X” in one request per week.

If you’d like to opt out you can comment out the “countme=1” line in the
repository config files in /etc/yum.repos.d/

Discussion for this should be directed to the AlmaLinux Infrastructure
mailing list. You can join the list at
https://lists.almalinux.org/mailman3/lists/infra.lists.almalinux.org/

-- 
Jonathan Wright
AlmaLinux Foundation
Mattermost: chat <https://chat.almalinux.org/almalinux/messages/@jonathan>

Jonathan Wright

tags

participants (1)