tape_2046 / beatrice vorster                    © 2023 









01 ///

Rhythm is a state of imbalance, the sonic unit always negotiates its position in relation to the previous and following frequency; it is part of an index. We are dealing with actors or particles in a space whose relationship unfolds durationaly; a social situation which is self duplicating across different time scales. In the sonic framework, vibration is communication. Rhythmic actors become interfaces of energy which, in the context of language, create meaning through intention. The augmented experience of communication calls for a renegotiation of these networks between the organic and sentient; tracing this triangulation of language, meaning and rhythm in a computational framework. Essentially, the acceleration of machinic natural language communication highlights the limitations of linguistics as a formulaic approach for identifying modes of communication. What becomes apparent is rather the delineation between communication and expression; advances in machinic utterances positioning anthropocentric attachments to language-as-sentience/intelligence as somewhat irrelevant. Rather, what offers greater potential for approaching the territories of language, castrated from meaning, could be exercises in considering the rhythmic relationship between music and speech.


02 ///


In the framework of rhythmanalysis, our conception of rhythm is largely determined by the external rhythm’s relationship with our organic bodies - the internal processing rhythms of heartbeat, breath, dilation (Lefebvre, 1992). However, certainly we live in an increasingly augmented environment - the audio ecology is as much dominated by rhythms which are on the periphery, outside of or contrasting with our bodily experience, namely the machinic-digital. Through amplification, recording and generative technologies, rhythm becomes something which is not necessarily tied to organic bodies, but rather to the experience of data. Language as a tool of communication inhabits this rhythm-as-data space as much as music practices in the motion of sounding:

“Human speech production process first translates a text (or concept) into movements of muscles associated with articulators and speech production-related organs. Then using air-flow from the lung, vocal source excitation signals, which contain both periodic (by vocal cord vibration) and aperiodic (by turbulent noise) components, are generated. By filtering the vocal source excitation signals by time-varying vocal tract transfer functions controlled by the articulators, their frequency characteristics are modulated. Finally, the generated speech signals are emitted. The aim of TTS [Text to Speech]  is to mimic this process by computers in some way”  (2016)

Essential in the exercise of reproducing natural language is rhythmic integrity. Early text to speech experiments such as NETTalk’s neural network (1986) demonstrates the computer's ability to generate speech from a dataset, in this case of children’s literature. However, what distinguishes this computational action of sounding from human speech synthesis is its rhythmic structure.

fig. 1 Three audio examples of NETtalk, a neural network created by Terrence Sejnowski and Charles Rosenberg designed to pronounce English text. The first example is an initial iteration of NETtalk with input from a 5 year old. The second example takes place twenty iterations after the first, making it much more refined. The third example is the same trained network but with new text.

03 ///

WaveNet’s use of raw audio models - in excess of 16,000 samples per second - gestures towards Curtis Road’s labour in the microsound. The temporal elasticity of the digital sample: from the infinite / supra musical encompassing natural periodicities of months, years, decades, centuries, and greater, to the the sample and subsample which take account of digital and electronic rates "too brief to be properly recorded or perceived", measured in millionths of seconds (microseconds), and finally the infinitesimal or infinitely brief, are again in the extra-musical domain (2004).

fig. 2 Wavenet’s raw audio model.

In contrast to concatenative TTS models, where a very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances, WaveNet relies on a choral / cloud assemblage; “that training on many speakers made it better at modelling a single speaker than training on that speaker alone, suggesting a form of transfer learning”. This model bears a relationship to Deleuze and Guattari’s writings on indirect discourse which cannot be explained in the distinctions between subjects, but rather in the assemblage: “the collective assemblage is always like the murmur”.

Just as musical employment of rhythm necessarily requires the space between units, language is only understood when units are experienced both as units and as a relational assemblage. This latent space between the continuum of indirect discourse which is not spoken and the enunciation is a similarly productive site, that is to say - the absence of a unit carries with it *something* which is key to emission. 

Rhythmic sequences whether musical or in speech, appear as social bodies; a corpus.

04 ///

The use of sampling technology - of manipulation and duplication of recorded matter - is equally present in NLP models and electronic music practices.

Each sonic grain carries with it the energy of the original. Significantly, the raw audio model employed by WaveNet uses human voice waveforms as training input sequences. Energy is transferred beyond membranes of organic and inorganic, encoded with the emotional life of the recorded speaker. In the case of music practices, the record is less and less a recording of a performance and increasingly something which is birthed in software. Mp3’s as data undergo a becoming haptic when amplified in space, data becomes vibration which then touches our bodies. Julian Henriques’ work in Sonic Bodies proposes the notion that bodies, in sync as a response to rhythm can be conceived of as an energy exchange which includes affective temperature rise of the individuated body and the collective body: a becoming vibrational entity (2011). In this potential of sonic grains, the rhythmic system simultaneously communicates as units in a collective body but further on the micro level as a kind of electron, particle or cellular system.

This process of transduction can be illustrated most straightforwardly in tracing recorded material performed by an organic body - most notably, the Winston Brothers ‘Amen Break’ which has historically undergone actions of processing, atomisation and duplication. The Winston Brothers’ 1969 ‘Amen Brother’ break, originally lasting 7 seconds, carries with it the gesture of the drummer [the controlled muscular spasms, performed by the body in a moment of energy exertion] , the resonance of the recording space in Washington D.C, the texture of the recording material, the haptics of the vinyl [the damage of the needle], digitised and shared online again and again. The looping of these 7 seconds, often at double speed resurrects this moment of freestyle again and again and relocates this spatially, demanding a bodily reaction.

fig. 3  oldschool breakbeat dance - everybody is in the place.mp4

Intuitive gesture → muscular spasm → reaction → energy exertion [strike] → skin of the drum → vibration of air surrounding → incoming soundwave * converted by microphone into electrical signal → production of time-varying magnetic field in the gap of the magnet → movement of tape * magnetisation → record of the electric signal 

This reproduction of a recorded moment of aliveness can resonate with Manuel De Landa’s conception of nonorganic life - articulating that ‘life is not the sole property of organic creatures’. He gives the example of the tree which grows from a seed, whose DNA molecule organises matter into a tree but that this process requires CO2 - it is not that CO2 is something inorganic which becomes organic, underscoring that all elements are part of a living whole (2015). Similarly, WaveNet’s TTS development, their use of multiple human voices as raw audio data meant that the model encoded further characteristics in the audio apart from the voice itself - “it also mimicked the acoustics and recording quality, as well as the breathing and mouth movements of the speakers” (2016).

fig. 4 organic energy exertion of bass drummer in Verdi: Requiem (2-1) "Dies irae" (Chorus) / 威爾第:安魂曲.mp4

05 ///

The comparably realistic mimicry of spoken language in WaveNet’s model lies in imitating elements of human communication which are not tied to meaning, but rather in moments on the periphery of understanding. Philosophy of communication iterates the lack of certainty between the origin and the destination, that there is no assurance that the message will be received. Michel Serres designates this as the parasite who disrupts meaning (parasite in french can also be defined static) which is concerned with data loss through mediated communication (2013). This is as much a part of communication as the intentional production of meaning. 

The process of uninterrupted understanding can be illustrated as follows:

idea → word [string] → association with a particular series of sounds → sounding / enunciation → vibration of air → vibration of eardrum of the listener → association with a particular series of sounds → word [string] → idea

What is apparent is that language is a string of sounds which act as a substitute for an idea. This becomes increasingly complex when imagined in the context of natural language processing which makes visible the organic process of understanding. Importantly, what is highlighted is that the labour of attaching meaning to sounds is as much a human as a computational process. In other words, babble with intention = language, regardless of the interlocutor. This can be demonstrated with WaveNet’s exercises in promoting the network without a textual context. 

fig 5 “If we train the network without the text sequence, it still generates speech, but now it has to make up what to say. As you can hear from the samples, this results in a kind of babbling, where real words are interspersed with made-up word-like sounds” (2016). 

Comparatively, early speech development in humans also feel communicative in their rhythmic sounding:

fig 6, 7 early speech development in humans. 

These mime the functionalities of meaning production, necessarily exhibiting itself as language while bordering signification, presupposing indirect discourse. This effect is made apparent in the rhythmic consistency of ‘simlish’ a nonsensical, virtual mode of expression derived from the 2000 game ‘Sims’. Deleuze and Guattari cite Bakhtin’s notion that there must be an “extra something” which “remains outside the scope of the entire set of linguistic

categories and definitions”. This notion of an “extra something” which directs meaning outside of sound=word=meaning is most clearly demonstrated in the untranslatable expressions of our avatars.  Again, significantly, this action is performed by human voices: bodies further performing a physical language [not visible to the player] which corresponds to the sounding aspects of the fictional language.

fig 8 simlish recording. Krizia Bajos and Scott Whyte at a recording studio inside Electronic Arts in Redwood, California.mp4

06 ///

In the realm of
expression =  communication + extra something           
the voice depends on the signified. Reliant on the vibration of the vocal cords in the process of sounding, there is an existing discourse around the relationship between the voice and identity: that the uniqueness of the voice does determine a subjectivity - “it communicates the uniqueness of the one who emits it, and can be recognized by those to whom one speaks” (Cavavero, 2005). This relationship explored by LaBelle is expanded on through readings of the performativity of the voice; that enactments of speech frictions with semantics, “the relation of sense and nonsense, of the semantic and the sounded, is to be appreciated as the very fabric of voice, and it is the mouth’s ability to flex and turn, resonate and stumble, appropriate and sample, which continually reminds us of the potentiality promulgated in being an oral body” (2014).

Mimicry, as a tool of transferring a reality becomes increasingly complex when applied to computational utterances. The phonation of the voice - whether computational or human - rather than on the periphery of expression is quite central and dependent on the sample.

This periphery is perhaps best expressed in the distinct vocal excesses of Flirta D and Catherine Jauniuax.

fig 9, 10 vocal mimicry in musical performance

07 ///

This framework is not determining that music is a language, but rather that attention to rhythmic properties of expression presents a site for composition. The study of isochrony and prosody could be applied to reflect elements of language not encoded by grammar or choice of vocabulary as a way of thinking of the rhythmic importance of communication.

These rhythmically important tools of transmission which lie outside of words, form the basis of my research artefact. The process consisted of collecting samples to render an audio dataset which can be triggered by an external sample pad and sequencer. Of specific interest was the collecting of audio data from WaveNet’s article which imitated parts of human speech not attached to language for example the sound of the encoded breath or of hesitation. These samples were then loaded into two built Max granular patches - one a gaussian granular with a softer sound and a sugar granular synthesis patch which allowed for harsher interpolation. The textures were then combined with human voice recordings based around sounding rather than meaning in addition to swedish and simlish audio samples. In generating this sound organism, I was attempting to, on the one hand, fracture a sense of rhythmic regularity through the use of a polyrhythmic structure, while using the voice as an instrument.




Cavarero, A., 2005. For more than one voice. Stanford, Calif.: Stanford University Press.

DeLanda, M., 2019. Philosophy and Simulation: The Emergence of Synthetic Reason. London: Bloomsbury Revelations.

Deleuze, G. and Guattari, F., 2013. A thousand plateaus. London: Bloomsbury.

Dolar, M., 2006. A voice and nothing more. Cambridge, Mass.: MIT Press.

Eshun, K., 1999. More brilliant than the sun. London: Quartet Books.

Henriques, J., 2011. Sonic Bodies. Continuum International Publishing.

LaBelle, B., 2014. Lexicon of the mouth. 1st ed. London: Bloomsbury Academic.

Lefebvre, H., 2013. Rhythmanalysis. London: Bloomsbury Revelations.

Reynolds, S., 2013. Energy flash. London: Faber and Faber.

Roads, C., 2004. Microsound. Cambridge, Mass.: MIT Press.

Serres, M., 2013. The parasite. Minneapolis, MN: University of Minnesota Press.

van den Oord, A. and Dieleman, S., 2016. WaveNet: A generative model for raw audio. [online] Available at: <> [Accessed 23 April 2022].

video: 2016. 1990s Drum and Bass Club, People Dancing With the Lights On. [online] Available at: <> [Accessed 1 May 2022]. 2019. How the Language From the Sims Was Created. [online] Available at: <> [Accessed 1 May 2022]. 2009. oldschool breakbeat dance - everybody is in the place. [online] Available at: <> [Accessed 1 May 2022]. 2012. NETtalk Test. [online] Available at: <> [Accessed 1 May 2022]. 2010. Toddler Jibberjabber Conversation with Grandfather. [online] Available at: <> [Accessed 1 May 2022]. 2022. Verdi: Requiem (2-1) "Dies irae" (Chorus) / 威爾第:安魂曲. [online] Available at: <> [Accessed 1 May 2022]. 2019. Viral video of baby talking to his dad will melt your heart. [online] Available at: <> [Accessed 1 May 2022].

post script