Language growing
Language is exchange of arbitrary probabilistic evidence. What constraints does this impose on the structure of language?
There is a vague category of ideas related to mathematical modeling of languages that I’ve been wanting to post about for a while. Indeed it was the second thing I promised to write in my half-dead blog (the first thing being the theorem marketplace idea). After more than a year, I still haven’t posted anything, because I’ve been unsatisfied with the shape of this idea (unlike the theorem marketplace thing, where the basic foundation is done and what is left is to build everything else on top of it). Since I still haven’t made much progress, I thought, what the hell, I’ll just write down my ideas in their half-baked state and anyone is welcome to add contributions.
***
The reason language fascinates me is because it’s a way of localizing arbitrary hypotheses in our world models.
I have a probabilistic view of language. I perceive hearing an utterance as observing a piece of evidence that promotes some hypotheses and demotes others in the mixture of hypotheses that is your Bayesian model of the world. Consequently, saying something to someone means trying to update their model of the world by causing them to observe probabilistic evidence.
The textbook explanations of the Bayes theorem may use the example of medical tests, describing this situation where you have this and this prior probability of having a certain disease: P(A), like 5%. Then you take a test which comes up positive. The probability of it being positive is, say, five times higher in those with the disease, than in the general population: P(B|A) / P(B) = 5. So you update your probability of being sick from 5% to 25%.
The thing is, observing a doctor say that you have this disease is, in some way, the same as observing a test come up positive. You have a prior probability; you receive a piece of evidence; you compare how often this evidence is encountered in worlds where the hypothesis is or isn’t true; you update the probability for the hypothesis in your world model accordingly.
The point of trying to communicate information X to someone is making them observe something that they would only observe in a world where X is true, and not in a world where X is false.
***
There is, however, a subtle difference between the textbook example of a medical test and the case of a doctor telling you about your condition. Do you see that difference?
The gigantic constraint imposed on the conditions of the standard Bayesian exercise is that we have a hypothesis in advance. A doctor, on the other hand, can communicate to you any information that she can conceive, and you interpret; she can mention any kind of disease, even a made-up one, she can say “it’s raining outside”, she can construct complicated, nested, totally arbitrary stuff like “my friend said that you think that there is a cat on the roof” - all of that unlike the textbook explanation of the Bayes formula.
This is essential. In the real world, testing a hypothesis that has been handed to you on a silver platter is rarely the bottleneck to any true cognition. The world we live in is so insanely multi-dimensional that our hypothesis space is vast and terrifying, and the actual complication is in handling all of these hypotheses at once, knowing which ones to discard and when, and using weak evidence that will allow us to favor a certain stronger candidate over millions of others.
In fact, remember how earlier I said that observing someone say something to you makes you promote some hypotheses and demote others? That’s how it probably would be with an ideal Bayesian agent with access to infinite compute, but I think we humans just discard the more complicated hypotheses by default. So when we hear a complex, arbitrary statement like “my friend said that you think that there is a cat on the roof“, what it does to us is something like “move the hypothesis to the closed set of things we didn’t discard“ — it makes us pay attention to it at all.
Communication between human beings is partly about causing each other to notice hypotheses which we had been disregarding.
***
What are the implications of this?
The question that interests me is what constraints this imposes on the structure of human language. If we were aliens who had never observed any kind of language before (I don’t know how this would have come to pass, but bear with me), and who were told as a fact, that humans had a way of exchanging probabilistic evidence of arbitrary hypotheses - what would we be able to deduce about the structure of language?
One reason this question is important is because the answer to it would likely describe the correspondence between language and the structure of the hypothesis space, which in turn would mean that by looking at the structure of human language, we would deduce something about human cognition and how our hypothesis space is structured.
But there is one more reason which, I suspect, might be lost on those who were never in love with linguistics as I once was (though this time is long gone).
The experience of studying linguistics is that of dealing with incredible, vast diversity held together by common underlying principles. It appears that whatever stochastic optimization process makes language emerge from human attempts at communication, and at passing language down to their descendants, it inevitably reuses the same types of general solutions.
And when we look at these solutions, we ask ourselves: why must language have stratification? why must it have a vocabulary, a morphology, a syntax? why must the syntax be recursive? etc.
(The urgency of these questions is highlighted by the fact that sign languages - languages usually employed by deaf people - end up having the same general features despite having a good deal of differences from spoken languages, and despite using vastly different material for communication - manual signs instead of spoken or written signs. Whatever the source of these common properties of language, it’s not just about spoken languages - these features somehow just pop out whenever you try to have a human language, on any material.)
I think the way we could have an actual, technical, precise answer to these questions is by satisfactorily answering the question “what restrictions does it impose to be a system used for exchange of arbitrary probabilistic evidence?“.
***
I think this approach to language warrants experimentation. The experiments I want to design would, ideally, allow to observe how language emerges and develops and unfolds itself, with its characteristic structure appearing naturally, each feature arising because it solves a specific problem. Even more ideally, it would allow to tweak the initial parameters of the setup in a way that makes features reproducibly arise, change, or disappear, as a kind of ablation study which would allow us to show which feature of human language solves what problem.
(This is why I call this whole project “language growing“ in my head.)
One obvious thing I wish to try is to use reinforcement learning to simulate the kind of stochastic process where language emerges in a situation with agents having to pursue some kind of a common goal that requires communication, and, ideally, agents teaching newer agents language, thus passing it down.
I have not yet designed such an experiment. An important blocker is that I haven’t come up with a way to make agents share a hypothesis space that is non-trivial in any way. Agents exchanging a fixed set of hypotheses will not do - all they would have to do is use a fixed set of signals.
(One source of inspiration here is the so called Solomonoff induction, which describes an ideal Bayesian sequence-predictor with access to unlimited compute. It is indeed represented as a weighted mixture of all possible hypotheses, with weights assigned to different hypotheses going up and down as new evidence arrives. But there, the set of all hypotheses corresponds to all Turing machines predicting the sequence, which doesn’t seem very similar to how language works.)
***
So this is where I’m stuck, for I don’t know how to further this research agenda just yet. If you see how to design a Bayesian language-growing experiment, please describe what I’ve missed. If you see a way to attack this from a theoretical point of view, entailing precise results from precise assumptions, in a way that gets results I have missed, please also let me know.