Scholarly article on topic 'Voice, (inter-)subjectivity, and real time recurrent interaction'

Voice, (inter-)subjectivity, and real time recurrent interaction Academic research paper on "Psychology"

Share paper
Academic journal
Frontiers in Psychology
OECD Field of science

Academic research paper on topic "Voice, (inter-)subjectivity, and real time recurrent interaction"

frontiers in PSYCHOLOGY

Cognitive Science

Voice, (Inter-)Subjectivity, and Real Time Recurrent Interaction

Fred Cummins

Journal Name: ISSN:

Article type: Received on: Accepted on:

Provisional PDF published on:


Copyright statement:

Frontiers in Psychology 1664-1078

Original Research Article 30 Apr 2014 27 Jun 2014 27 Jun 2014

Cummins F(2014) Voice, (Inter-)Subjectivity, and Real Time Recurrent Interaction. Front. Psychol. 5:760. doi:10.3389/fpsyg.2014.00760

© 2014 Cummins. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

This Provisional PDF corresponds to the article as it appeared upon acceptance, after rigorous peer-review. Fully formatted PDF and full text (HTML) versions will be made available soon.

Voice, (Inter-)Subjectivity, and Real Time Recurrent Interaction

Fred Cummins UCD School of Computer Science and Informatics University College Dublin

June 24, 2014

4 Abstract

5 Received approaches to a unified phenomenon called "language" are firmly committed to

6 a Cartesian view of distinct unobservable minds. Questioning this commitment leads us to

7 recognize that the boundaries conventionally separating the linguistic from the non-linguistic

8 can appear arbitrary, omitting much that is regularly present during vocal communication. The

9 thesis is put forward that uttering, or voicing, is a much older phenomenon than the formal

10 structures studied by the linguist, and that the voice has found elaborations and codifications in

11 other domains too, such as in systems of ritual and rite. Voice, it is suggested, necessarily gives

12 rise to a temporally bound subjectivity, whether it is in inner speech (Descartes' "cogito"), in

13 conversation, or in the synchronized utterances of collective speech found in prayer, protest and

14 sports arenas world wide. The notion of a fleeting subjective pole tied to dynamically entwined

15 participants who exert reciprocal influence upon each other in real time provides an insightful

16 way to understand notions of common ground, or socially shared cognition. It suggests that

17 the remarkable capacity to construct a shared world that is so characteristic of Homo sapiens

18 may be grounded in this ability to become dynamically entangled as seen, e.g., in the centrality

19 of joint attention in human interaction. Empirical evidence of dynamic entanglement in joint

20 speaking is found in behavioral and neuroimaging studies. A convergent theoretical vocabulary

21 is now available in the concept of participatory sense-making, leading to the development of a

22 rich scientific agenda liberated from a stifling metaphysics that obscures, rather than illuminates,

23 the means by which we come to inhabit a shared world.

24 1 Introduction

25 We speak with confidence of something called "language", as if this term referred to a single system,

26 capable of multiple forms of manifestation (writing, speech, signing), but unified by organized

27 structures and processes in the formal domains of phonology, morphology, syntax, and semantics.

28 This emphasis on systematicity and symbolic encoding has utterly dominated the scientific view

29 of "language" at least since the structuralist innovations of Saussure (1959/1916), and has been

30 greatly reinforced by the pivotal role of generative linguistics in the birth of the cognitivist account

31 of mind as a form of symbol-based information processing (Fodor, 1975). In the context of inter-

32 personal communication, language, on this view, serves as a form of message passing, whereby

33 ideas conceived in the mind of one person are encoded, first into words, and then into movements

34 of mouth or hand, at which point they become transmittable to another, who sets about decoding

35 them, thereby gaining access to the ideas of the sender. The message passing perspective on

36 language is compelling, powerful, and supported by a host of technologies, from the very first forms

37 of writing to the most sophisticated of digital platforms.

38 The emphasis on symbols and systematicity allows the identification of a tentative boundary

39 between the linguistic and the non-linguistic. For example, a conventional distinction is drawn

40 between phonological and non-phonological characteristics of the sounds of speech. Roughly, those

41 features that support the identification of discrete categories such as phonemes, are taken as indices

42 of linguistic structure, while non-categorical and continuously varying features such as the loudness

43 of a voice would lie beyond the notional bounds of language proper. Once discrete entities belonging

44 to non-overlapping categories are available, they can be combined into larger symbolic structures,

45 from syllables to novels.

46 Language thus appears to be a clearly delineated and unified phenomenon, of which one can

47 meaningfully construct theories. This leads to a compelling observation that there seems to be

48 a yawning chasm between the many kinds of communication systems found in animals and the

49 generative, creative richness found in every human language. And so the foundations are laid

50 for the perplexing observation that language seems to have appeared not so very long ago in an

51 evolutionary timescale, and to have immediately enabled the development of the whole of human

52 culture, technology, and all the institutions of all societies.

53 Two related observations will serve to provide us here with a slightly different view of "lan-

54 guage". The first is that the above story is fundamentally committed to an ontological split between

55 mind and world. If we accept such a split, then meanings or ideas belong firmly in the realm of the

56 mental, and they find expression indifferently in writing or speech, each of which provides a kind of

57 physical container for the passing of ideas from one mind to the next. The second observation is that

58 the traditional story enforces a somewhat arbitrary divide between the linguistic and non-linguistic,

59 motivated by the desire to ensure that language is systematic and supports the kind of symbolic op-

60 erations familiar from syntax and related disciplines. If we observe communication among people,

61 we see many aspects to that behavior that never feature in linguistic theory, and that nevertheless

62 seem to be reliably and essentially associated with inter-personal communication. These two ob-

63 servations are related, because if we consider alternatives to the Cartesian mind/world split that

64 divides ideas and meanings from sounds and movements, the apparent significance of many of the

65 behaviors and features reliably and regularly attending communication may change, and with that,

66 the boundaries of "language" may shift, or, indeed, fragment, to reveal a variety of phenomena

67 that do not admit of a single systematic description.

68 I will argue that the way in which we conventionally treat of the phenomenon called "language"

69 is overly restrictive, and seems more appropriate to the characterization of writing than speak-

70 ing/listening (Linell, 2005). Older than writing by far is the voice, and the voice has remarkable

71 properties all of its own. Chief among these is the obligatory association between the voice and

72 a transient subject-pole that grounds intentionality. This, it seems to me, may be part of the

73 reason the inner voice seems to be inextricably associated with the Cartesian subject. To develop

74 this notion, I will turn to the substantive domain of joint, or collective, speaking, showing how

75 collective speech engenders a different kind of subject, displaying collective intentionality. Fur-

76 thermore, just as the voice of the individual admitted of development and codification in writing,

77 so collective speaking admitted of development and codification in practices of liturgy and ritual.

78 Written language, which is the more accurate target of modern linguistics, is thus not the only

79 descendent of voice. The empirical study of collective speaking is in its infancy, but it reveals

so emergent phenomena that arise only in the real time reciprocal interaction of speakers speaking in

81 unison. These emergent phenomena add substance to the argument that the traditional depiction

82 of language as message passing mischaracterises, or omits, much of what is going on in vocal com-

83 munication (Cowley and Love, 2006). It neglects the fluid intertwining of subjectivities that arises

84 in real time reciprocal interaction, and that appears clearly in joint speaking. This only becomes

85 apparent if we approach languaging (rather than language) as a set of multi-faceted behaviors that

86 defy characterization from a single metaphysical viewpoint 1.

87 2 Revisiting Descartes

88 Let us fancifully drop in on Descartes as he deduces his own existence. The statement "Cogito,

89 ergo sum" is without doubt the most famous line in Western Philosophy, and the basic outline

90 of the argument underlying it is overly familiar2. A sceptical philosopher, wishing to establish a

91 foundation for true and certain knowledge, recognizes that the world of appearances, mediated by

92 the senses may be illusory. He considers what remains after denying the testimony of the senses,

93 and reasons thus:

94 So after considering everything very thoroughly, I must finally conclude that this propo-

95 sition, I am, I exist, is necessarily true whenever it is put forward by me or conceived

96 in my mind. (Meditation 2, AT 7:25)

97 The "I" that is invoked here is explicitly and emphatically not a body, but a mind (7:27). The

98 split between mind and world is absolute. Irrespective of how the consequences are played out,

99 Descartes' certainty has become the split we have failed to distance our selves from. Substance

100 dualism narrowly conceived is, of course, not a respectable metaphysical position any more, but

101 the split that is effected here between mind and world, and at the same time, between metaphysics

102 and epistemology, far from being overcome, has become the foundational assumption upon which

103 the whole of psychology (and more) has been built. As Sheets-Johnstone put it, it has become

104 "a lexical band-aid covering a 350-year-old wound generated and kept suppurating by a schizoid

105 metaphysics" (Sheets-Johnstone, 1999, p. 275)

106 But what is going on for Descartes? There is a voice. Whether it is a voice speaking in Latin

107 "Cogito, ergo sum!", or a voice speaking in French "Je suis, j'existe!", it is a (silent) utterance—a

108 thought in the form of words. Without language (better: languaging), there is no such thought.

109 Without a culturally specific history of vocal interaction among people during which meanings and

110 uses of language emerge, there is no such voice. The solipsistic prison of Descartes' fancy is not

111 so devoid of other people as he seems to believe, for in harbouring the voice that can utter the

112 "Cogito!," it is populated by the practice of Latin, or the practice of French. Closing the eyes does

113 not keep out the world, and it does not keep out other people.

114 The inner voice of linguistic thought that speaks here "to" Descartes is not different in kind from

115 the outer voice of overt speech. Indeed, the whole metaphorical quagmire associated with the use of

116 the terms inner and outer stems from the very confusion I wish to here circumvent. Vygotsky has

XA complementary account of languaging from an enactive perspective is provided in Bottineau (2010). This account adheres to a more conventional view of what the domain of language is than adopted here, but many of the fundamental concerns raised therein resonate with the themes of this article.

2The famous Latin phrase does not appear in the Second Meditation, where the original argument is most clearly made.

117 presented a thorough argument that the overt but self-directed speech of young children is, firstly,

118 a specialization of intersubjective social speech, and secondly, is the precursor to inner speech, or

119 linguistic thought (Vygotsky, 1986). This insight provides us with an understanding of continuity

120 between overt speech and silent speech, or linguistic thought.

121 What if we choose to interpret Descartes' predicament somewhat differently? Instead of consid-

122 ering the voice as evidence of a pre-existing subject, we might consider it to give rise to a transient

123 subjecthood. We cannot understand the occurrent thought as an utterance in the message-passing

124 sense, as there are not two distinct domains, a speaker and a listener, for any message to be passed

125 among. But we are now entertaining the tentative notion that there is no Cartesian subject before

126 the occurrence of the thought, and so any subjecthood associated with this utterance arises with

127 the utterance and fades thereafter. This is not a fully fledged psychological subject, equipped with

128 the mechanisms of "cognitive systems"; it is a subject-pole that allows a distinction between sub-

129 ject and world, or self and other, to be discerned, and that supports or invites the ascription of

130 intentionality. It is a transient orientation, tied to the real time unfolding of the linguistic thought

131 itself ("... whenever it is put forward by me or conceived in my mind."). Later in the 2nd Med-

132 itation, Descartes himself seems to concur with this association of the Subject with the transient

133 inner voice when he says "I am, I exist—that is certain. But for how long? For as long as I am

134 thinking. For it could be that were I totally to cease from thinking, I should totally cease to exist."

135 (Meditation 2, AT 7:27) Now the nature of "thinking" has not been generaly agreed upon, but

136 the form of thinking Descartes here alludes to is clearly the utterance of an inner voice, in specific

137 words, words which he is capable of repeating to us, words which we can characterize as Latin or

138 French. I wish to pursue this idea, that voice gives rise to the complementarity between the poles

139 of subject and world, and it does so in real time.

140 3 Voices and Subjects

141 [V]oice is a kind of sound of an ensouled thing. For none of the things without soul

142 gives voice, though some are said by analogy to give voice, such as the flute and the

143 lyre and whatever other of the things without soul have the production of sustained,

144 varied and articulate sound. For voice also has these features and so there is a likeness.

145 (Aristotle, 1986, 420b, p. 178)

146 The association between the animate (even ensouled) subject and the voice is ancient. In

147 Connor (2000), the long history of the subjects perceived as being behind voices emanating from

148 unlikely places is recounted in detail. From the Delphic oracle through the medieval fascination

149 with demonic possession, prophecy and divine inspiration, voices perceived as coming from the

150 stomach, the genitals, or even a crack in the rock have been enthusiastically attributed to invisible

151 subjects, rather than to sound-producing properties of either inanimate objects or of atypical parts

152 of the body itself. Much of the ghoulish fascination that the ventriloquist's dummy attracts lies in

153 the obligatory projection of a subject behind the grotesque appearance. Connor writes:

154 For I produce my voice in a way that I do not produce these other attributes [eyes, hair,

155 gait, fingerprints, etc]. voice is the process which simultaneously produces

156 articulate sound, and produces myself, as a self-producing being. (Connor, 2000, p. 3)

157 It is telling that the words uttered in one of the very earliest sound recordings, made by Alexan-

158 der Graham Bell in 1881, are "T-r-r—T-r-r—There are more things in heaven and earth Horatio,

159 than are dreamed of in our philosophy—T-r-r—I am a Graphophone and my mother was a Phono-

160 graph" (Volta Laboratory, 2013, Emphasis added), thus instinctively investing one of the very first

161 disembodied voices born of technology with subjecthood of its own. Remarkably, the telephone and

162 the phonograph came into being almost simultaneously—in 1876, 1877. Add to these the advent of

163 radio transmission of the human voice, first done in 1900 in Brazil, and it is clear that we have been

164 awash in disembodied voices for over a hundred years and counting. The irritating proliferation of

165 pseudo-personalities such as the iPhone's Siri seems likely to continue.

166 If the voice Descartes conjures up alone generates a subjectivity that is aligned with the classic

167 subject-object distinction at the level of the single individual, then we might give consideration

168 to the possibility that voice employed in different circumstances might generate other forms of

169 subjectivity, without commitment to individual Cartesian minds.

170 3.1 Shared Subjectivity and Common Ground

171 When an utterance is made in a specific context with speaker and listener both present, it is in-

172 terpreted in the light of the shared understanding of all parties. This has found expression in

173 theoretical notions of common ground (Clark and Brennan, 1991), or socially shared cognition

174 (Schegloff, 1991). Most developments of the idea of common ground are couched within the infor-

175 mation processing/message passing framework, and therefore make use of some version of aligned

176 or shared representational content. However it is not necessary to appeal to such unobservable

177 constructs from a hidden Cartesian world (Hutto and Myin, 2013). There is ample evidence that

178 participants in a conversational exchange become mutually linked in many subtle but observable

179 ways. Eye movements (Richardson et al., 2007), postural sway (Shockley et al., 2009) and even

180 blinking (Cummins, 2012) have all been found to become subtly intertwined in conversation, leading

181 to a dynamic entanglement of the participants. Speakers and listeners are further linked through

182 the provision by the latter of signals of ongoing engagement through postural, gestural and vocal

183 indices or backchannels (Wagner et al., 2014).

184 The yoking together of two or more people engaging in language behavior establishes a common

185 basis from which the participants confront the world. It makes available a shared framework within

186 which statements can be interpreted. It thus provides a scaffold for shared intentionality (Carr,

187 1987). The ability to share an intentional perspecitive seems to be at the very heart of human

188 language use, but it is not an all or nothing affair. Two protesters with common purpose who

189 chant the same slogan demonstrate an extreme alignment with respect to the world. But two

190 people engaged in heated disagreement must still achieve a great deal of alignment in order to

191 disagree felicitously. The topic of disagreement must be foregrounded, at the expense of everything

192 else. In disputing causal chains, in laying out competing sequences of events, and in presenting

193 different interpretations of the significance of actions and events, two disputants are necessarily

194 sharing a great deal of background framing, picking out these events rather than those, identifying

195 the same actors, while quarrelling over their respective roles. Even in the absence of conversational

196 exchange, people observing the same scene exert reciprocal influence on one another, such that their

197 gaze behavior, and by inference, the details they pay attention to, become inter-dependent. In a

198 series of experiments summarized in Dale et al. (2013) gaze behavior of subjects are demonstrated

199 to depend sensitively on the presence of others, and on whether one subject knows or believes that

200 the others are seeing and hearing the same things as they are.

201 If joint languaging provides a very powerful example of intentional alignment, then it might

202 be that that the ability to coordinate the manner in which we jointly pay attention to the world

203 is an important skill that facilitated the emergence of such behavior, as argued in Fusaroli and

204 Tylen (2012). Sometime between the last speciation event some 5 or 6 million years ago that gave

205 rise to chimpanzees and bonobos on the one hand, and the hominid line on the other, something

206 happened that had profound consequences for our ability to share perspectives and to coordinate

207 with one another. There is one small biological change that we know occurred in that time, that

208 might play a significant role here. That change gave rise to the white sclera of the human eye

209 that contrasts vividly with the darker iris, thus providing a very clean signal of the direction of

210 gaze of a partner (Tomasello et al., 2007). The other great apes do not have such a contrast, and

211 their ability to align their gaze is severely limited, and based on head direction rather than the

212 eyes—although chimpanzees and bonobos in particular do display some evidence of understanding

213 the visual perspective of another (Okamoto-Barth et al., 2007). The ability to follow each other's

214 gaze thus facilitates the sharing of attention, and has been demonstrated to structure mother-child

215 interactions, while inducing the abilty to take part in languaging (Tomasello and Farrar, 1986).

216 As common ground is established, the subjective point from which utterances are spoken also

217 shifts. Vygotsky has pointed out how the (linguistic) subject becomes an implied, rather than

218 an overt, element in speech once common understanding has been established (Vygotsky, 1986, p.

219 236). For example, it would be odd to respond to the question "Would you like a cup of tea?"

220 with the answer "No, I don't want a cup of tea", instead of simply "No". Similarly, a group of

221 people waiting for a bus establishes sufficient shared context that no one is likely to point out the

222 obvious and say "The bus for which we are waiting is coming", but simply "coming" or some such

223 expression. The dropping of the linguistic subject is more extensive yet in inner speech, of which

224 Vygotsky says "it is as much a law of inner speech to omit subjects as it is a law of written speech

225 to contain both subjects and predicates" (Vygotsky, 1986, p. 243). Many languages allow dropping

226 of any explicit mention of the subject once they can be inferred on pragmatic grounds. This is

227 not merely a syntactic quirk of one group of languages, as it is found in such typologically distant

228 languages as Japanese, Chinese, Turkish, and Spanish (Huang, 1984).

229 It would be a mistake to simply equate the subject pole of a subject-world complementary pair

230 with the syntactic subject, but it would be inexcusable too to ignore the deep link between the

231 fundamental linguistic structure of subject and predicate on the one hand and the subjective pole

232 from which utterances are brought forth on the other. The subject pole that arises in the unfolding

233 of the voice grounds intentionality, and provides an anchoring point for reference. This is, perhaps,

234 most explicit in the manner in which deixis functions, allowing use of terms such as "there",

235 "here", "then", "now", whose meaning is anchored in the joint situation created by conversational

236 participants; It is also explicit in the manner in which the first personal pronouns, both singular

237 and plural, find flexible and context-specific use. It is implicit too in establishing a shared register

238 and perspective within which meaning is negotiated. The differentiation of subject and world, and

239 the ability to establish a shared perspective within which utterances function, precedes any overt

240 syntactic knowledge or awareness by millennia (Olson, 1996).

241 3.2 Alignment versus Synergy

242 The dynamic intertwining of conversational participants interacting in real time has not gone un-

243 noticed. An influential approach to account for the many overt and subtle ways in which two

244 interlocutors become linked is found in the Interactive Alignment model of Pickering and Garrod

245 (2004, 2014). This model seeks to describe the tendency for conversational partners to imitate one

246 another at a variety of levels, from syntactic biasing, through lexical selection, down to the level of

247 phonetic and gestural imitation. The idea that similarity in one domain can unconsciously bleed

248 through representational levels to generate similarity in other domains provides some explanatory

249 purchase on a great deal of corpus-based data. As a general account of the dynamic coupling and

250 mutual accommodation found among speaker/listeners, however, it is somewhat limited. It leaves

251 language resolutely within the heads of individual conversing partners, and this does not move

252 beyond the Cartesian, representationalist framework. It is "representation-hungry", demanding

253 computational representations at many levels, and indeed, in its most recent form, it conjures up

254 a baroque series of simulations inside the heads of individuals who must not only act, but also

255 predict the actions of others (Pickering and Garrod, 2014). This approach does not generalize

256 in any obvious way to multi-party conversations. Nor does it account for coupling among inter-

257 actants that are not strictly imitative in nature, as with the mutual influence exerted on blinks

258 (Cummins, 2012). The tendency to alignment suggests that felicitous conversation would result in

259 mere mimicry, which is again not what we observe, and it privileges similarity, at the expense of

260 complementarity, thereby missing the fundamental role-based nature of conversation in which the

261 positions of speaker and listener alternate.

262 A competing account has recently been proposed that regards inter-personal coordination in

263 dialogue as a form of synergy or dynamical coupling (Fusaroli et al., 2014). This approach is

264 rooted in dynamical approaches to coordination that are level-agnostic, seeking to understand

265 emergent phenomena at one level (e.g., the dyad) as arising through processes of self-organization

266 from the constrained interaction of autonomous components at a lower level (the speaker/listeners)

267 (Kelso, 1995; Latash, 2008). This approach highlights the sensitivity of participants to real time

268 recurrent interaction, as is evident even in the early interactions of infants and mothers (Murray

269 and Trevarthen, 1986). It emphasizes the intertwining of the movements of participants, leading

270 to dimensional reduction, so that two interacting persons become, temporarily, a simpler collective

271 entity than the two persons considered as a mere conjunction of individuals. It acknowledges both

272 synchronized and complementary actions as they contribute to this simplification, and it emphasizes

273 the manner in which shared understanding of task constraints leads to stability of patterning in

274 time. Although still somewhat speculative, this level-independent approach seems commensurate

275 with the approach to be developed here that treats groups of people as synergetically organized

276 domains in their own right, with respect to which subjectivities of a collective nature can be

277 identified.

278 Synergistic approaches to human communication have been argued for by others. Thibault

279 (2011) adopts a position not unlike the present one in which a fundamental distinction is drawn

280 between what he calls talk and text. The role of voice described both here and in his work empha-

281 sizes the bodily entrainment that arises at a very fine scale among interactants, while the properties

282 that linguists conventionally consider, and that admit of a computational description, constitute

283 a distinct, and second-order set of phenomena. Although not focussed on languaging, Riley et al.

284 (2011) argue that interpersonal movement coordination is the result of establishing interpersonal

285 synergies of the sort described here, and they distinguish between component-dominant dynam-

286 ics, as portrayed within a cognitivist framework, with interaction-dominant dynamics in which the

287 autonomy of the level of interaction is more thoroughly acknowledged. Finally, the perceptual

288 crossing paradigm introduced by Auvray et al. (2009) provides a minimalist experimental set up

289 in which two people interact in real time in a minimal virtual space. While not communicative

290 in any conventional sense, the nature of the emergent behavior observed serves to illustrate the

291 principal point being made that the interaction itself constitutes a level of relative autonomy that

292 is not reducible to the conjunction of properties of its components (Froese et al., 2014). These

293 latter two examples illustrate that social interaction and languaging are not separate phenomena.

294 Languaging is a constitutive part of the manner in which interpersonal entrainment or coupling

295 arises in the moment by moment real time reciprocal interaction among people.

296 3.3 Voice versus Writing

297 Before giving further consideration to the relationship between subjecthood and voice, it is appro-

298 priate to recall the vast chasm that separates speech from writing, not least as the claim is made

299 here that most of the phenomena described by modern linguistics relate, in fact, to the structure

300 of written communication, and are only indirectly relevant to the act of speaking, which is the cen-

301 tral form in which languaging is manifested (Linell, 2005). Since the advent of alphabetic writing

302 in Greek society, a naive view has been available that writing is simply a device for transcrib-

303 ing speech. Olson (1996, p. 66) identifies overt statements that express this view from Aristotle,

304 Saussure, Bloomfield and more. This is why theories of syntax, morphology and semantics, that

305 together delimit much of that which we call "language", allow themselves to study and model

306 the formal characteristics of symbol strings, without consideration of the medium of expression.

307 This insensitivity to the enormous differences between writing and speech underlies the focus by

308 Saussure on langue rather than parole, and by Chomsky on competence, rather than performance.

309 With that, modern linguistic theory has turned its attention away from the most common form of

310 languaging, indeed the only one that existed from the fuzzy origins of speech until the relatively

311 recent development of writing and the even more novel phenomenon of mass textual proliferation.

312 It has ignored the real time reciprocal interaction among people giving voice from context-specific

313 situations of concern.

314 We have now a wealth of research that documents very substantial changes that arise with

315 the advent of writing, and especially with the spread of literacy consequent to the development of

316 printing. These changes affect not only the way language is used, but the very structure of the

317 consciousness of language users (Stewart, 2010). Ong (1982) provides an authoritative and com-

318 prehensive catalogue of differences between the way knowledge is managed, shared, and verbalized

319 in primary oral cultures, and in highly literate ones. Olson (1996) further documents the profound

320 conceptual and cognitive implications of the spread of literacy. Much of this work focusses on the

321 novelties that accompany writing and literacy. McLuhan claimed that "writing was an embalming

322 process that froze language" (McLuhan, 1964), and he provides an anectode from Prince Modupe,

323 who speaks of his encounter with the written word in his West African days:

324 The one crowded space in Father Perry's house was his bookshelves. I gradually came

325 to understand that the marks on the pages were trapped words. Anyone could learn to

326 decipher the symbols and turn the trapped words loose again into speech. The ink of

327 the print trapped the thoughts; they could no more get away than a doomboo could get

328 out of a pit. . . (McLuhan 1964, p. 84)

329 With writing, texts achieve an independence from their sources. A spoken utterance is neces-

330 sarily vouched for by the speaker, while a written sentence asserts, without the contingency and

331 commitment of a speaker. I have mentioned that voice gives rise to a subjective pole. Here we

332 can see that the complement is also true: Writing gives rise to a particular kind of objectivity, one

333 in which for the first time it is possible to have "facts that speak for themselves" (Latour, 2013).

334 (For an insightful account of several ways in which objectivities are constructed, see Daston and

335 Galison, 2007.) Written sentences remain immutable and thus support dissection and analysis in

336 a way that spoken utterances, which must be articulated each time they come into being, do not.

337 The further development of speech and language technologies in the service of message passing has

338 given rise to forms of spoken langauge, e.g. in news broadcasts or public service announcements,

339 that bear greater similarity to written texts than to spoken utterances, while recent increases in the

340 possibility of text-based reciprocal exchanges, e.g. in SMS messaging, further serve to complicate

341 the relation between voices, texts, messages, and intentions3.

342 It is interesting in this regard to consider the constraint observed by Everett to hold in the

343 language of the Piraha, an Amazonian tribe whose language is remarkable in its simplicity and

344 omissions, having no counting system, very restricted tenses, arguably no syntactic recursion, etc.

345 The Piraha also have no mythology or stock of fiction. Everett attributes many of these constraints

346 to what he calls the Immediacy of Experience Principle, according to which statements by the

347 Piraha "contain only assertions related directly to the moment of speech, either experienced by the

348 speaker or witnessed by someone alive during the lifetime of the speaker" (Everett, 2009a, p. 132).

349 Here, the strong tie betwen the speaker and the words spoken appears to have become sedimented

350 into the very structure of the language and culture, leaving no room for the disembodied words

351 found in writing. It is perhaps no coincidence that Everett's observations have become controversial

352 precisely among those linguists who hold syntax, and syntactic recursion in particular, to be central

353 to the very nature of language (Hauser et al., 2002; Everett, 2009b).

354 4 Speaking in Unison

355 The act of speaking in unison is a common form of vocal behavior that is accorded no particular

356 theoretical significance in a message passing view of language. On the received view, minds and

357 subjects are closed and singular; thus many people saying the same thing at the same time appears

358 merely as a multiplication of the individual speaker. The behavior does seem somewhat perplexing

359 though, for what message is being passed if we all know the words? It is worthwhile to consider

360 both the occasions in which people often speak in unison, and the form of the speech so produced.

361 "Joint speaking" is an umbrella term I have coined to cover all occasions in which the same

362 words are uttered by multiple people in unison (Cummins, 2013a). This includes many practices

363 of collective prayer, the chants of both protest demonstrators and sports fans, the recitations of

364 young school children, performances of choral speech, and the swearing of collective oaths in secular

365 contexts. To all these naturally occurring variants we can also add the simultaneous reading of

366 novel texts by pairs (or more) of speakers in the laboratory in a paradigm known as Synchronous

367 Speech (Cummins, 2003; Cummins, 2009).

368 This brief survey of situations in which people speak in unison makes it clear that this behavior

369 is very widespread, and is found in virtually every culture. It is thus a central, and not a peripheral,

370 example of languaging. With the exception of joint speaking in classrooms, which serves a multi-

371 tude of purposes imposed by educational authorities rather than expressing any sentiment of the

372 speakers, all of the naturally occurring forms of joint speech are found in situations in which the

373 attribution of collective, shared, intentionality seems to straightforwardly capture the significance

374 of the practice for participants. In prayer contexts, collective speaking testifies to shared beliefs.

3My thanks to the anonymous reviewer who pointed out that the stark dichotomy between spoken and written texts has become considerably more complex.

375 In protest, the shared purposes of the crowd are made manifest through chanting. Among sports

376 fans, chants are a means by which collective identity is sustained and asserted. None of this is at all

377 surprising, nor in need of precise definition—at least, no more precise than seems warranted for the

378 attribution of beliefs, desires and intentions to individuals. While we may not all be enthusiastic

379 chanters, even a reluctance to join in such behavior testifies to the obligatory assocation of such

380 voicings with the underlying sentiments.

381 But if message passing does not illuminate such behavior, it seems fair to ask how we might

382 better characterize it; why are people engaging in such vocal activity, if not to pass ideas around?

383 While there is probably not a single answer to this question, a useful conceptual approach suggests

384 itself from the theory of speech acts (Austin, 1975). Austin noted that many utterances achieve

385 something simply by virtue of being spoken. Examples include "I pronounce you man and wife",

386 or "I apologize for my behavior". Such utterances he called "performatives". In the treatment pro-

387 vided by Austin, they are frequently signalled by such verbs as "pronounce", "decree", "promise",

388 etc. The set of performatives Austin alludes to, and the associated set of acts performed is very

389 restricted. If there is merit to the idea that uttering gives rise to the complementary poles of

390 subject and world, then all utterances might properly be considered to be performatives, and the

391 establishment of a transient subject pole with an implicit intentional structure would then be an

392 achievement of the act of uttering. This approach to understanding joint speech helps to make

393 sense of some of its most reliable features. In what follows I will consider mainly the three most

394 common forms of joint speech4: collective prayer, protest chanting, and sports chanting.

395 All three forms of joint speech are frequently, almost inevitably, characterized by repetition:

396 the same phrase or short verse is repeated tens, or even hundreds of times over. Repetition makes

397 sense if the temporally bound act of utterance is required to establish and maintain a transient

398 subject pole with respect to which we can identify beliefs or intentions. Repetition is undergirded

399 by physical actions such as fist pumping, bead twiddling, or arm waving. While bead manipulation

400 is relatively private, the more macroscopic actions further serve to facilitate synchronization among

401 participants.

402 Repetition also serves to accentuate and exaggerate the rhythmic properties of utterances, while

403 repetition of a short phrase can also induce a change in perception from speech to song (Deutsch

404 et al., 2011). In repeated spoken chants, the form of speech that arises thus blends seamlessly

405 into the musical domain, establishing a continuity between speech and music. The close relation

406 between spoken and sung chant is signalled by the very ambiguous nature of the word "chant" in

407 English which applies with equal facility in either domain. It is interesting that a focus on collective

408 speech makes a continuum between speech and music appear natural, even obligatory, while the

409 message passing perspective as articulated most clearly by Pinker (1999) insists on an absolute

410 divide between the two domains. On the message-passing view, speech is an expression of the highly

411 valued notional faculty of language, and thus central to our human minds, while music is denigrated

412 as "auditory cheesecake", with no—from his perspective—apparent functional significance, thus

413 meriting being grouped together with artistic expression, cheesecake and pornography (Pinker,

414 1999). If anything illustrates the limited capacity to describe, or even see, that the message passing

415 perspective induces, surely it is this failure to appreciate the continuum we are all familiar with that

416 extends from instrumental music, through song, rap, poetry, rhymes, rhetoric, and chant (Cummins,

417 2013a). We might note in passing that the contrast between the real time participatory nature of

418 the voice that is here contrasted strongly with the frozen nature of writing finds a strong parallel

4I hypothesize—I am not sure how one might measure relative frequency here.

419 in contemporary discussion of the relationship between live musical performance and recording

420 (Chanan, 1995).

421 We like to speak of the "wisdom of crowds", but the rather more familiar notion of the ignorance

422 of the mob, whose powers of reason are not to be trusted, is perhaps more apt for many of the

423 situations under consideration. While groups have frequently been found to outperform individuals

424 in tasks of judgement and estimation (Koriat, 2012), groups involved in joint speech of protest

425 are often found in volatile situations where collective actions are rudimentary and aggressive. It is

426 worth noting though that some degree of sophistication in the beliefs that are jointly articulated

427 is provided by the formal scaffold of call and response. The device of having a single leader call

428 a series of questions to which the crowd provides a series of responses is found in both prayer

429 and protest, though perhaps less so in sports chants. In prayer, this sequence of leading call and

430 collective response is often formalized into liturgical rites, allowing for a great deal of complexity in

431 the beliefs that are thereby expressed. In protest, it is far more common to see only a single call,

432 and a single response, and the very nature of protest mitigates against the kind of codification found

433 in ritual liturgical practices. Sports chanting seems to be more concerned with the demonstration

434 of collective identity than with the formulation of explicit statements of belief or intention, and

435 call-and-response chants are less common.

436 If we view writing as an elaboration of some aspects of speaking, i.e. a technological extrapola-

437 tion that gives rise to a formal system of the kind studied under the somewhat misleading label of

438 "language", then we might observe that vocal behavior, or languaging, appears to have other ex-

439 trapolations, other forms of extension, and other forms of codification, so that the formal constructs

440 of the linguists are not the only descendents of the voice. Collective speech has found integration

441 into rituals in a great diversity of traditions. The Abrahamic religions all formalize collective speak-

442 ing within their respective services, and in each of them the rituals integrate joint speaking into

443 a carefully orchestrated sequence of complementary acts by service leaders and participants that

444 include highly stylized sequences of movements such as bowing, kneeling, marching, etc. Other

445 religious traditions have engaged in similar forms of codification (Bell, 1988). Parallels between

446 linguistic grammar and ritual structure have previously been noted (Michaels et al., 2010), but the

447 principal point argued here is that voice has given rise to more than one species of formalization.

448 Liturgy and ritual do not admit of the same generative mutability as freely spoken or written text,

449 but by codifying such utterances in collective speech and ritual, the implicit intentional structure

450 that arises in speaking and performing, together with the associated belief structure, is stabilized.

451 With such observations, the boundaries of "language" become somewhat less determinate, and the

452 subjects that find voice become both more numerous and more varied.

453 4.1 Dynamic Entanglement in Synchronous Speaking

454 If the relation between voice and subjectivity put forward here has merit, joint speaking appears

455 as an extreme example that can serve to hone our considerations of the form and nature of col-

456 lective intentionality. In monologue, I alone dictate the intentional ground of my utterances; in

457 conversation, the shared ground is fluid and negotiated; in chanting it is immovable. Are there then

458 any signatures of joint intentionality that we can observe? In the spirit of the dynamical coupling

459 hypothesis of Fusaroli et al. (2014), we might look for evidence that joint speakers are strongly

460 coupled, giving rise to emergent phenomena at the supra-individual level.

461 In a series of behavioral studies in which speakers are asked to read novel texts in unison, no

462 major differences that would serve to pick out speech as collective based on its acoustic characteris-

463 tics alone have been observed (Cummins, 2014). Speech produced in these constrained laboratory

464 settings is remarkably unremarkable, and the technique of having subjects speak in synchrony has

465 been used as a device for obtaining unmarked speech in several phonetic studies (Krivokapic, 2007;

466 Kim and Nam, 2008; O'Dell et al., 2010; Dellwo and Friedrichs, 2012). The unmarked phonetic

467 structure of speech elicited in the synchronous speaking situation contrasts strongly with the ob-

468 servation that texts recited in ritual and rite are frequently, if not inevitably, highly stylized in

469 prosodic form. For example, consider the typical pattern with which the Hail Mary is said when

470 reciting the rosary, or, in a secular context, the characteristic form of the Pledge of Allegiance

471 as recited by American schoolchildren. Prosodic stylization thus appears as a reliable, but not

472 necessary characteristic of joint speech.

473 There is one form of speech error found in a synchronous speech task that seems to be unique to

474 that situation, and that illustrates a strong dynamic coupling between speakers. When one speaker

475 makes a speech error, it is frequently, though by no means always, observed that both speakers

476 stop speaking simultaneously. Sometimes this abrupt cessation can even be in mid-syllable. Abrupt

477 and simultaneous cessation of speech seems to be unique to this situation, and I have previously

478 compared it to the collective tumbling that happens so readily in a three-legged race if either

479 participant makes a misstep (Cummins et al., 2013). This seems to suggest that the task of

480 synchronizing leads to a close intertwining of the process of speech production by each speaker,

481 leaving each vulnerable to mistakes by the other. This observation might be tempered, however,

482 by noting that the degree of synchronization found in the laboratory is typically much greater than

483 that found in the wild, where relatively loose temporal alignment is common and tolerated.

484 A second source of empirical phenomena associated with joint speaking comes from an fMRI

485 study by Jasmin and co-workers (Jasmin et al., in preparation), in which subjects spoke prepared

486 sentences in a variety of conditions, including speaking alone, listening, speaking in synchrony with

487 the experimenter and speaking in synchrony with a recording of the experimenter. Importantly,

488 subjects were not informed of the difference between the latter two conditions, and on debriefing,

489 they were never aware that recordings were used at all. In contrasting the regional blood flow

490 subsequent to speaking in the latter two synchronization conditions, a marked difference was found

491 in macroscopic patterns of cortical activity, despite the obliviousness of subjects to the contrast.

492 In particular, synchronization with a live person was characterized by an increase in activity in

493 right hemisphere locations, including the temporal pole, supramarginal gyrus, superior temporal

494 gyrus and the right hemisphere homologue of Broca's area—the latter three are areas that, in the

495 left hemisphere, are reliably implicated in speech production activity. There is thus a large scale

496 alteration to the well-known hemispheric asymmetry that attends speech production, but only when

497 the speaker is coupled in real time to another speaker, and not when the non-self voice has the

498 inflexibility of a recording.

499 5 Voice, (Inter-)Subjectivity and real time Recurrent Interaction

500 As scientists, there is a need to acknowledge that the metaphysical background within which one

501 works makes some inquiries possible, and some impossible. For all the acknowledged successes

502 of the message passing view of language rooted in a Cartesian framework, there are very many

503 familiar phenomena that have been passed over, or, at best, relegated to the outer wastelands of

504 the non-cognitive and non-linguistic. I have here sought to work with a notion of the subject that

505 is an emergent property of specific kinds of interpersonal interaction rooted in real time reciprocal

506 exchange. This unconventional view of the subject brings with it a very different view of what

507 language is, to the point where the systematic formal system described by modern linguistics

508 no longer appears to be describing the human capacity to create shared perspective, to generate

509 a shared common ground, and to bring forth a common world. Where received approaches to

510 "language" treat of regularities found in sequences of symbols, I have focussed on the voice, uttered

511 from a specific concerned perspective, and necessarily tied to the real time negotiation of a subjective

512 pole. In the voice, we find a strong index of intentionality, but an intentionality that shifts, that

513 arises fluidly, that is sometimes grounded in an individual, sometimes in a negotiated context, and

514 that sometimes seems to emerge at the collective level in a manner no longer reducible to the

515 thoughts, beliefs, and perspectives of the contributing individuals (Carr, 1987). This dissociation

516 of the voiced subject from the solipsistic individual is seen perhaps most clearly in the case of

517 joint speech. The emphasis on voice and intentionality serves to position the symbolic domain

518 of structural and generative linguistics as a specific, limited, extrapolation and codification of an

519 older practice of uttering that has given rise to several distinct extensions and codifications in such

520 domains as ritual and rite.

521 The loosening of metaphysical commitments that results when we abandon the Cartesian subject

522 offers the opportunity to reconsider many phenomena, and joint speech provides an important and

523 familiar case in point. The practice of joint speech is not restricted to any particular culture. As

524 well as being ubiquitous, it is immediately apparant that the situations in which people speak

525 collectively do not form an arbitrary or incoherent set. All such situations seem to provide strong

526 evidence of collectively held beliefs, and it is through the collective voicing that this attribution

527 becomes warranted. It might help here to note that the subjectivity being treated so rudely is not

528 coextensive with the mind of an individual, nor with the idea of a cognitive system, conceived of as a

529 set of sub-personal information processing mechanisms that some hypothesize to underlie observed

530 behavior. The subject pole referred to here is an aggregate to whom it makes sense to attribute

531 a limited range of intentions, and in particular, beliefs. I am thus wielding the term "belief" here

532 in a sense rather like the dispositional account provided by Ryle (1949). This flexible notion of

533 the subject seems to work when applied to an individual, a conversing dyad, or a lynch mob, each

534 of whom can be said to speak from a distinct position, with a specific perspective. In strenuously

535 avoiding the Cartesian split between mind and world, we would do well to avoid adopting an overly

536 rigid metaphysical position. Rather, if subjects admit of the kind of treatment proposed here,

537 then an ontological lightness of touch that can encompass many kinds of intentional subjects seems

538 warranted.

539 The empirical phenomena described above strongly highlight the importance of real time dy-

540 namic interaction among people in generating the subject-pole to which beliefs can sensibly be

541 attributed. The neural signature of collective speaking is found when speaking with a live speaker,

542 but not with a recording (Jasmin et al., in preparation). Live conversational partners become entan-

543 gled not only in ways that fit a linguistic description (lexical priming, syntactic biasing, phonological

544 and phonetic imitation, Pickering & Garrod, 2004), but in a host of subtle ways that have hitherto

545 been treated of as non-linguistic. These include gaze, posture, gestures, and blinks, but this set

546 might conceivably be considerably extended as researchers turn their attention more and more to

547 physiological markers of interaction (Campbell, 2007; Richardson et al., 2007; Shockley et al., 2009;

548 Cummins, 2012; Wagner et al., 2014). The voice is an important part of the means by which a

549 collective perspective is established and maintained, but it is one among many. The interaction

550 of voice and gaze may play a particularly strong role in allowing the protracted sustainment of

551 conditions of joint attention, which appears as a possible foundation for the shared intentionality

552 required to ground a human cultural world (Tomasello et al., 2005)5.

553 The dynamic entanglement seen in conversation, and in joint speech, can be empirically de-

554 scribed as a form of mutual coordination, whereby two or more participants display a transient

555 inter-dependence on many levels (Shockley et al., 2009; Fusaroli et al., 2014). This third-person

556 account lends itself well to ethological and experimental observation and modelling. A well-worked

557 mathematical framework for describing how autonomous systems that interact in real time can give

558 rise to emergent phenomena at the collective level is available, e.g. as illustrated by the field of

559 coordination dynamics (Kelso, 1995; Oullier and Kelso, 2009). Social cognitive neuroscience has

560 recently begun to recognize that nervous systems of interacting individuals behave quite differently

561 from those of solitary subjects, and often become inter-dependent (Hari and Kujala, 2009; Babiloni

562 and Astolfi, 2012; Schilbach et al., 2013). This opens up a vast empirical research agenda for the

563 future.

564 But the shifting ground of subjectivity that is here espoused poses challenges for description

565 from a phenomenological or experiential point of view. Here, the recent concept of participatory

566 sense-making may be of assistance (De Jaegher and Di Paolo, 2007; Fuchs and De Jaegher, 2009).

567 Participatory sense-making extrapolates from the basic enactive account that grounds sense-making

568 (perception/action in the service of the generation of meaning) in the adaptive interaction of an

569 autonomous agent with its environment (Froese and Di Paolo, 2011). Building on this perspective,

570 participatory sense-making describes how the moment-to-moment interaction of two subjects gives

571 rise to a mutuality in their joint sense-making, allowing for the joint creation of meaning. On

572 this account, the emergent domain constituted by the inter-dependent activities of two or more

573 subjects warrants treatment as a phenomenological domain in its own right (Cummins, 2013b).

574 Intersubjectivity then is the enactment of a novel phenomenological domain in the sustained, real

575 time coordinated activities of two or more people. There appears to be a convergence of the

576 theoretical vocabulary and the demands raised by empirical studies that bodes well for further

577 scientific work.

578 A host of open questions relate to the role of clock time and synchronized behavior. In collective

579 speaking, we observe highly coordinated action that relies, not on a common external beat or

580 timekeeper, but on shared knowledge among interactants. Highly synchronized behavior that is

581 scaffolded by an external beat is also very common, as in music making, marching, or dancing, but

582 this kind of collective entrainment does not seem to bring with it an automatic sense of commitment

583 to underlying beliefs or intentions. We are all familiar with western school kids dancing happily to

584 the religiously tinged beats of Bob Marley, without worrying about whether they really subscribe

585 to the tenets of Rastafarianism. Much work remains to be done in gaining a better understanding

586 of how collective coordinated behavior gives rise to collective intentionality, and what the necessary

587 preconditions for that in the contributing individuals are.

588 A willingness to countenance subjective poles that are not co-extensive with the individual

589 person, and that rise and fade in a dynamic fashion, is incompatible with the grounding assumptions

590 of much of conventional psychology. Of course, psychology itself has grappled since its inception

591 with the boundaries of the subject (Dewey, 1896). One way of describing the subject matter of

592 psychology is with reference to the twin poles of experience and behavior, for which a causal account

593 is sought. This approach looks out at the world from a subject whose existence, persistence, and

5 Small wonder then that the appearance of "language" appears utterly mysterious from the vantage point of modern linguistics (Hauser et al., 2014). The discipline has defined its own subject almost out of existence.

594 integrity is taken for granted. The approach taken here, and enabled by the enactive framework

595 more generally, is to reverse the direction of inquiry, from a view towards experience (whose?) and

596 behavior (by whom?), and to look instead at the shifting referents of the personal pronouns "I",

597 "we", "you", etc. It is here that it becomes apparent that the received view of language will not

598 serve, any more than the notion of a solipsistic mind. Of course the contemporary scientific view

599 of language is deeply rooted in a specific set of psychological commitments, and a view of mind as

600 information processing, that together gave birth to the cognitivist worldview. Adopting a different

601 stance with respect to the ground of experience must, it seems, go hand in hand with a willingness

602 to question the boundaries that have traditionally served to demarcate the linguistic domain. This

603 opens up the enticing prospect that we might begin to question, negotiate, and re-evaluate just

604 what, and who, "we" think "we" are.

605 In Seeger (2004), an account is provided of the way music and song are integrated into the lives

606 of the Suya people of the Amazon basin. Some songs, the shout songs, are sung from what we

607 might consider a conventional egocentric perspective. Others are sung in unison. Of these Seeger

608 notes:

609 The Suya men said they sang shout songs for their sisters . . . When I asked them

610 for whom they sang unison songs, they responded that they simply sang them. They

611 weren't for anyone. A man did not sing a unison song as a brother, lover, or individual.

612 He sang it as a member of a group, whose identity was partly established through the

613 song. Thus they sang for a general audience: the act of singing was the statement. In

614 some sense, invocations had no audience at all . . . (Seeger, 2004, p. 83)

615 Acknowledgements

616 I am indebted to three anonymous reviewers who provided very thoughtful feedback which improved

617 the present contribution.

618 References

619 Aristotle (1986). De Anima (On the Soul). Penguin UK.

620 Austin, J. L. (1975). How to do Things with Words. Oxford University Press.

621 Auvray, M., Lenay, C., and Stewart, J. (2009). Perceptual interactions in a minimalist virtual environment.

622 New Ideas in Psychology, 27(1):32-47.

623 Babiloni, F. and Astolfi, L. (2012). Social neuroscience and hyperscanning techniques: past, present and

624 future. Neuroscience & Biobehavioral Reviews.

625 Bell, C. (1988). Ritualization of texts and textualization of ritual in the codification of Taoist liturgy. History

626 of Religions, pages 366-392.

627 Bottineau, D. (2010). Language and enaction 10. In Stewart, J. R., Gapenne, O., and Di Paolo, E. A.,

628 editors, Enaction: Toward a new paradigm for cognitive science, page 267. MIT Press.

629 Campbell, N. (2007). On the use of nonverbal speech sounds in human communication. In Verbal and

630 Nonverbal Communication Behaviours, pages 117-128. Springer.

631 Carr, D. (1987). Cogitamus ergo sumus: The intentionality of the first-person plural. In Interpreting Husserl,

632 pages 281-296. Springer.

633 Chanan, M. (1995). Repeated Takes: A Short History of Recording and its Effects on Music. Verso.

634 Clark, H. H. and Brennan, S. E. (1991). Grounding in communication. In Perspectives on Socially Shared

635 Cognition, pages 127-149. American Psychological Association.

636 Connor, S. (2000). Dumbstruck: A Cultural History of Ventriloquism.. Oxford University Press Oxford.

637 Cowley, S. J. and Love, N. (2006). Language and cognition, or, how to avoid the conduit metaphor. In

638 Bridges and Walls in Metalinguistic Discourse, pages 135-154. Peter Lang, Frankfurt, DE.

639 Cummins, F. (2003). Practice and performance in speech produced synchronously. Journal of Phonetics,

640 31(2):139-148.

641 Cummins, F. (2009). Rhythm as entrainment: The case of synchronous speech. Journal of Phonetics,

642 37(1):16-28.

643 Cummins, F. (2012). Gaze and blinking in dyadic conversation: A study in coordinated behaviour among

644 individuals. Language and Cognitive Processes, 27(10):1525-1549.

645 Cummins, F. (2013a). Joint speech: The missing link between speech and music? Percepta—Revista de

646 Cognigao Musical, 1(1):17-32.

647 Cummins, F. (2013b). Towards an enactive account of action: Speaking and joint speaking as exemplary

648 domains. Adaptive Behavior, 21(3):178-186.

649 Cummins, F. (2014). The remarkable unremarkableness of joint speech. In Proceedings of the 10th Interna-

650 tional Seminar on Speech Production, pages 73-77, Cologne, DE.

651 Cummins, F., Li, C., and Wang, B. (2013). Coupling among speakers during synchronous speaking in English

652 and Mandarin. Journal of Phonetics, 41(6):432-441.

653 Dale, R., Fusaroli, R., Duran, N., and Richardson, D. C. (2013). The self-organization of human interaction.

654 Psychology of Learning and Motivation, 59:43-95.

655 Daston, L. J. and Galison, P. (2007). Objectivity. Zone Books, New York.

656 De Jaegher, H. and Di Paolo, E. (2007). Participatory sense-making: An enactive approach to social

657 cognition. Phenomenology and the Cognitive Sciences, 6(4):485-507.

658 Dellwo, V. and Friedrichs, D. (2012). Variability of speech rhythm in synchronous speech. In Proceedings of

659 Speech Prosody 2012, V. 2, pages 539-542.

660 Deutsch, D., Henthorn, T., and Lapidis, R. (2011). Illusory transformation from speech to song. Journal of

661 the Acoustical Society of America,, 129(4):2245-2252.

662 Dewey, J. (1896). The reflex arc concept in psychology. Psychological Review, 3(4):357.

663 Everett, D. L. (2009a). Don't Sleep, There are Snakes: Life and Language in the Amazonian Jungle. Random

664 House LLC.

665 Everett, D. L. (2009b). Piraha culture and grammar: A response to some criticisms. Language, 85(2):405-

666 442.

667 Fodor, J. A. (1975). The Language of Thought. Harvard University Press.

668 Froese, T. and Di Paolo, E. (2011). The enactive approach: Theoretical sketches from cell to society.

669 Pragmatics & Cognition, 19(1).

670 Froese, T., Iizuka, H., and Ikegami, T. (2014). Embodied social interaction constitutes social cognition in

671 pairs of humans: A minimalist virtual reality experiment. Scientific Reports, 4.

672 Fuchs, T. and De Jaegher, H. (2009). Enactive intersubjectivity: Participatory sense-making and mutual

673 incorporation. Phenomenology and the Cognitive Sciences, 8(4):465-486.

674 Fusaroli, R., Raczaszek-Leonardi, J., and Tylen, K. (2014). Dialog as interpersonal synergy. New Ideas in

675 Psychology, 32:147-157.

676 Fusaroli, R. and Tylen, K. (2012). Carving language for social coordination: A dynamical approach. Inter-

677 action Studies, 13(1):103-124.

678 Hari, R. and Kujala, M. V. (2009). Brain basis of human social interaction: from concepts to brain imaging.

679 Physiological Reviews, 89(2):453-479.

680 Hauser, M. D., Chomsky, N., and Tecumseh Fitch, W. (2002). The faculty of language: what it is. who has

681 it, and how did it evolve? Science, 298:1569-1579.

682 Hauser, M. D., Yang, C., Berwick, R. C., Tattersall, I., Ryan, M., Watumull, J., Chomsky, N., and Lewontin,

683 R. (2014). The mystery of language evolution. Frontiers in Psychology: Language Sciences, 5:401.

684 Huang, C.-T. J. (1984). On the distribution and reference of empty pronouns. Linguistic Inquiry, pages

685 531-574.

686 Hutto, D. D. and Myin, E. (2013). Radicalizing Enactivism: Basic Minds Without Content. MIT Press.

687 Jasmin, K. J., McGettigan, C., Agnew, A. K., Josephs, O., Cummins, F., and Scott, S. K. Speaking together:

688 Blurring the boundary between self and other. Manuscript in preparation, May 2014.

689 Kelso, J. A. S. (1995). Dynamic Patterns. MIT Press, Cambridge, MA.

690 Kim, M. and Nam, H. (2008). Synchronous speech and speech rate. Journal of the Acoustical Society of

691 America,, 123(5):3736.

692 Koriat, A. (2012). When are two heads better than one and why? Science, 336(6079):360-362.

693 Krivokapic, J. (2007). Prosodic planning: Effects of phrasal length and complexity on pause duration.

694 Journal of Phonetics, 35(2):162-179.

695 Latash, M. (2008). Synergy. Oxford University Press, USA.

696 Latour, B. (2013). An Inquiry into Modes of Existence. Harvard University Press.

697 Linell, P. (2005). Written language bias in linguistics. Routledge.

698 McLuhan, M. (1964). Understanding Media: The Extensions of Mann. McGraw-Hill.

699 Michaels, A., Mishra, A., Dolce, L., Raz, G., and Triplett, K. (2010). Grammars and Morphologies of Ritual

700 Practices in Asia. Harrassowitz Verlag.

701 Murray, L. and Trevarthen, C. (1986). The infant's role in mother-infant communications. Journal of Child

702 Language, 13(1):15-29.

703 O'Dell, M., Nieminen, T., and Mustanoja, L. (2010). Assessing rhythmic differences with synchronous

704 speech. In Speech Prosody 2010-Fifth International Conference, volume 100141, pages 1-4.

705 Okamoto-Barth, S., Call, J., and Tomasello, M. (2007). Great apes' understanding of other individuals' line

706 of sight. Psychological Science, 18(5):462-468.

707 Olson, D. R. (1996). The World on Paper. Cambridge University Press.

708 Ong, W. (1982). Orality and Literacy: The Technologizing of the Word. Methuen & Co., London.

709 Oullier, O. and Kelso, J. A. S. (2009). Social coordination, from the perspective of coordination dynamics.

710 In Encyclopedia of Complexity and Systems Science, pages 8198-8213. Springer.

711 Pickering, M. J. and Garrod, S. (2004). Toward a mechanistic psychology of dialogue. Behavioral and Brain

712 Sciences, 27(02):169-190.

713 Pickering, M. J. and Garrod, S. (2014). Self-, other-, and joint monitoring using forward models. Frontiers

714 in Human Neuroscience, 8.

715 Pinker, S. (1999). How the mind works. Annals of the New York Academy of Sciences, 882(1):119-127.

716 Richardson, D. C., Dale, R., and Kirkham, N. Z. (2007). The art of conversation is coordination: Common

717 ground and the coupling of eye movements during dialogue. Psychological Science, 18(5):407-413.

718 Riley, M. A., Richardson, M. J., Shockley, K., and Ramenzoni, V. C. (2011). Interpersonal synergies.

719 Frontiers in psychology, 2.

720 Ryle, G. (1949). The Concept of Mind. Barnes & Noble.

721 Saussure, F. d. (1959/1916). Course in General Linguistics. Philosophical Library, New York, NY.

722 Schegloff, E. A. (1991). Conversation analysis and socially shared cognition. In Resnick, L. B., Levine, J., and

723 Behrend, S. D., editors, Socially Shared Cognition, pages 150-171. American Psychological Association.

724 Schilbach, L., Timmermans, B., Reddy, V., Costall, A., Bente, G., Schlicht, T., and Vogeley, K. (2013).

725 Toward a second-person neuroscience. Behavioral and Brain Sciences, 36(04):393-414.

726 Seeger, A. (2004). Why Suya sing: A Musical Anthropology of an Amazonian People. University of Illinois

727 Press.

728 Sheets-Johnstone, M. (1999). Emotion and movement. a beginning empirical-phenomenological analysis of

729 their relationship. Journal of Consciousness Studies, 6(11-12):11-12.

730 Shockley, K., Richardson, D. C., and Dale, R. (2009). Conversation and coordinative structures. Topics in

731 Cognitive Science, 1(2):305-319.

732 Stewart, J. (2010). Foundational issues in enaction as a paradigm for cognitive science: From the origin

733 of life to consciousness and writing. In Stewart, J. R., Gapenne, O., and Di Paolo, E. A., editors,

734 Enaction: Toward a New Paradigm for Cognitive Science, pages 1-31. MIT Press.

735 Thibault, P. J. (2011). First-order languaging dynamics and second-order language: the distributed language

736 view. Ecological Psychology, 23(3):210-245.

737 Tomasello, M., Carpenter, M., Call, J., Behne, T., and Moll, H. (2005). Understanding and sharing intentions:

738 The origins of cultural cognition. Behavioral and Brain Sciences, 28(05):675-691.

739 Tomasello, M. and Farrar, M. J. (1986). Joint attention and early language. Child Development, pages

740 1454-1463.

741 Tomasello, M., Hare, B., Lehmann, H., and Call, J. (2007). Reliance on head versus eyes in the gaze

742 following of great apes and human infants: the cooperative eye hypothesis. Journal of Human Evolution,

743 52(3):314-320.

744 Volta Laboratory (2013). New optical scan results from the Smithsonian Volta Laboratory Collection.

745 Catalogue No. 312123,

746 Vygotsky, L. (1986). Thought and Language. MIT Press, Cambridge, MA.

747 Wagner, P., Malisz, Z., and Kopp, S. (2014). Gesture and speech in interaction: An overview. Speech

748 Communication, 57:209-232.

Copyright of Frontiers in Psychology is the property of Frontiers Media S.A. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.