LibrePlanet discussion list archive (unofficial mirror)
 help / color / mirror / Atom feed
* Machine learning and copyleft
@ 2016-12-10  0:42 Amias Hartley
  0 siblings, 0 replies; 2+ messages in thread
From: Amias Hartley @ 2016-12-10  0:42 UTC (permalink / raw
  To: libreplanet-discuss

[-- Attachment #1: Type: text/plain, Size: 3117 bytes --]

Let's consider a machine learning system consisting of two parts:

1. Training program. It takes a dataset and produces a trained model. A
trained model is usually stored as a few serialized arrays of floating
point numbers.
2. Inference program. This one takes a pre-trained model and some data as
input and produces some output based on them.

Both training and inference programs are free software licensed under GPL.

Let's also suppose that the there is a model that is a result of execution
of the training program taking as input some publicly available dataset.
This dataset (for example a set of labeled photographs or natural texts) is
permitted by the publisher to be unlimitedly used for training machine
learning models, and usage of the trained models is not restricted by the
dataset publisher.

So the entire system is distributed as: training program with sources,
inference program with sources, training dataset, trained model.

Someone could take this system, modify the training program, and train a
new model on the same dataset. Then he or she could publish only inference
program with sources, the unmodified training dataset, and the new trained
model. Because the end user doesn't need the modified training program to
run the inference program with the new model, it is not distributed,
because technically the only user of the modified training program is those
who trained a new model using it, so GPL doesn't require to distribute it.

However, in this case freedom of users of the distributed system (inference
program and the new model) is violated because they can't retrain the model
on new data or improve the training code and retrain it on the same data to
improve performance of the model.

My question is how is it possible to protect users' freedom by making
everyone who distributes a trained model to distribute also sources of a
training program that was used to train the model, and instructions for
obtaining the training dataset?

Could the problem be solved by GPL, or GPL is not enough for this case? If
not, is there a license that provides the required guarantees? I'd like to
note that by the definition of the problem the dataset is published by a
third party, and while it could be used without restrictions for any
machine learning task, it couldn't be relicensed, and anyway the protection
of freedom of obtaining modified sources of the training program should be
preserved even if some other dataset is used for training of a new model
instead of the original one.

P. S. It's an interesting question can model weights be considered to be
software or not. While some machine learning models could in theory contain
arbitrary logic, such as neural Turing machines (
https://arxiv.org/abs/1410.5401), others, such as convolutional neural
networks (https://en.wikipedia.org/wiki/Convolutional_neural_network) are
more limited in their capabilities, but in practice are very expressive,
and others, for example, logistic regression, are much more limited. It is
desirable to have a way to protect users' freedom without regard to
complexity of a particular model.

[-- Attachment #2: Type: text/html, Size: 3534 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Machine learning and copyleft
@ 2016-12-11 19:18 Richard Stallman
  0 siblings, 0 replies; 2+ messages in thread
From: Richard Stallman @ 2016-12-11 19:18 UTC (permalink / raw
  To: libreplanet-discuss; +Cc: Amias Hartley

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

Looking at this scenario, my conclusion is that the training program
is effectively a compiler: the training data set is the source code it
compiles, and the trained model is object code that it produces.

Thus, I agree that the trained model made from a private modified
version of the training data set is unethical.  It's a compiled
program, released without sources.

I don't think it makes sense to try to prevent this problem by
changing the license of either the training program or the inference
program.  That would be comparable to licensing an interpreter so that
it can only be used to run free programs -- it wouldn't be wise, and
(from what lawyers have told me) is not lawful use of copyright in the
US.

-- 
Dr Richard Stallman
President, Free Software Foundation (gnu.org, fsf.org)
Internet Hall-of-Famer (internethalloffame.org)
Skype: No way! See stallman.org/skype.html.



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-12-11 19:18 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-10  0:42 Machine learning and copyleft Amias Hartley
  -- strict thread matches above, loose matches on Subject: below --
2016-12-11 19:18 Richard Stallman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).