text to speech Archives - reactive music

synthesizing voices (formant synthesis, text to speech, Vocaloid)
processing voices (pitch-shifting, time-stretching, vocoding, filtering, harmonizing),
voices of the natural world
fictional languages and animals
accents
speech and music recognition
processing voices as pictures
removing music from speech
removing voices

Voices

We instantly recognize people and animals by their voices. As an artist we work to develop our own voice. Voices contain information beyond words. Think of R2D2 or Chewbacca.

There is also information between words: “Palin Biden Silences” David Tinapple, 2008: http://vimeo.com/38876967

Synthesizing voices

The vocal spectrum

What’s in a voice?

Formant synthesis in Max by Mark Durham: https://reactivemusic.net/?p=9294 (singing vowels with formants)
Formant synthesis Tutorial by Jordan Smith: https://reactivemusic.net/?p=9290 (making consonants with noise)

Singing chords

Humans acting like synthesizers.

Singing chords: Lalah Hathaway https://www.youtube.com/watch?v=c5AdOZtRdfE (0:30)
Tuvan throat singing: https://www.youtube.com/watch?v=5wHbIWH_NGc (near the end of the video)
Polyphonic overtone singing: Anna-Maria Hefele https://www.youtube.com/watch?v=vC9Qh709gas

More about formants

Formants (Wikipedia) http://en.wikipedia.org/wiki/Formant
Rooms have resonances: “I am sitting in a Room” by Alvin Lucier
Singer’s formant (2800-3400Hz).

Text to speech

Teaching machines to talk.

phonemes (unit of sound)
diphones (combination of phonemes) (Mac OS “Macintalk 3 pro”)
morphemes (unit of meaning)
prosody (musical quality of speech)

Methods

articulatory (anatomical model)
formant (additive synthesis) (speak and spell)
concatentative (building blocks) (Mac Os)

Try the ‘say’ command (in Mac OS terminal), for example: say hello

More about text to speech

History of speech synthesis http://research.spa.aalto.fi/publications/theses/lemmetty_mst/chap5.html (Helsinki University of Technology 1999)
Speech synthesizers, 2014 https://reactivemusic.net/?p=18141
Speech synthesis web API https://reactivemusic.net/?p=18138

Vocoders

Combining the energy of voice with musical instruments (convolution)

Peter Frampton “talkbox”: https://www.youtube.com/watch?v=EqYDQPN_nXQ (about 5:42) – Where is the exciting audience noise in this video?
Ableton Live example: Local file: Max/MSP: examples/effects/classic-vocoder-folder/classic_vocoder.maxpat
Max vocoder tutorial (In the frequency domain), by dude837 – Sam Tarakajian https://reactivemusic.net/?p=17362 (local file: dude837/4-vocoder/robot-master.maxpat

More about vocoders

How vocoders work, by Craig Anderton: https://reactivemusic.net/?p=17218
Wikipedia: http://en.wikipedia.org/wiki/Vocoder. Engineers conserving information to reduce bandwith
Heterodyne filter: https://reactivemusic.net/?p=17338 – digital emulation of an analog filter bank.
Max/MSP: examples/effects/classic-vocoder-folder/classic_vocoder.maxpat

Vocaloid

By Yamaha

(text + notation = singing)

Vocaloid website: http://www.vocaloid.com/en/
Hatsune Miku: https://reactivemusic.net/?p=6891

Demo tracks: https://www.youtube.com/watch?v=QWkHypp3kuQ

Vocaloid tutorial
- #1 https://www.youtube.com/watch?v=vcJDTDBWTrw (entering notes and lyrics – 1:25)
- #2 https://www.youtube.com/watch?v=qpGwgIyMGOk (raw sound – 0:42)
- #5 https://www.youtube.com/watch?v=YEAuL6Q2j-0 (with phrasing, vibrato, etc.,- 1:00)

Vocaloop device http://vocaloop.jp/ demo: https://www.youtube.com/watch?v=xLpX2M7I6og#t=24

Processing voices

Transformation

Pitch transposing a baby https://reactivemusic.net/?p=2458

Real time pitch shifting

Autotune: “T-Pain effect” ,(I-am-T-Pain bySmule), “Lollipop” by Lil’ Wayne. “Woods” by Bon Iver https://www.youtube.com/watch?v=1_cePGP6lbU

Autotuna in Max 7

by Matthew Davidson

Local file: max-teaching-examples/autotuna-test.maxpat

InstantDecomposer in Pure Data (Pd)

by Katja Vetter

http://www.katjaas.nl/slicejockey/slicejockey.html

Autocorrelation: (helmholtz~ Pd external) “Helmholtz finds the pitch” http://www.katjaas.nl/helmholtz/helmholtz.html

(^^ is input pitch, preset #9 is normal)

local file: InstantDecomposer version: tkzic/pdweekend2014/IDecTouch/IDecTouch.pd
local file: slicejockey2test2/slicejockey2test2.pd

Phasors and Granular synthesis

Disassembling time into very small pieces

sorting noise; http://youtu.be/kPRA0W1kECg
Phasors: https://reactivemusic.net/?p=17353

Time-stretching

Adapted from Andy Farnell, “Designing Sound”

https://reactivemusic.net/?p=11385 Download these patches from: https://github.com/tkzic/max-projects folder: granular-timestretch

Basic granular synthesis: graintest3.maxpat
Time-stretching: timestretch5.maxpat

More about phasors and granular synthesis

Shepard tone upward glissando by Chris Dobrian: https://reactivemusic.net/?p=17255
“Falling Falling” (Visual Shepard tone) https://reactivemusic.net/?p=17251
Ableton Live – granulator (Robert Henke)

Phase vocoder

…coming soon

Sonographic sound processing

Changing sound into pictures and back into sound

by Tadej Droljc

https://reactivemusic.net/?p=16887

(Example of 3d speech processing at 4:12)

local file: SSP-dissertation/4 – Max/MSP/Jitter Patch of PV With Spectrogram as a Spectral Data Storage and User Interface/basic_patch.maxpat

Try recording a short passage, then set bound mode to 4, and click autorotate

Speech to text

Understanding the meaning of speech

The Google Speech API

A conversation with a robot in Max

https://reactivemusic.net/?p=9834

Google speech uses neural networks, statistics, and large quantities of data.

More about speech to text

Real time German/English translator (Microsoft) http://digg.com/video/heres-microsoft-demoing-their-breakthrough-in-real-time-translated-conversation
Skype translator – Spanish/English: http://www.skype.com/en/translator-preview/
Dragon Naturally Speaking (Nuance) accidentally converts music to poetry

Voices of the natural world

Changes in the environment reflected by sound

Bernie Krause: “Soundscapes”
- The Voice of The Natural World: http://blog.ted.com/2013/06/12/the-voice-of-the-natural-world-bernie-krause-at-tedglobal-2013/
- TED: http://www.ted.com/talks/bernie_krause_the_voice_of_the_natural_world

Fictional languages and animals

“You can talk to the animals…”

Derek Abbot’s animal noise page: http://www.eleceng.adelaide.edu.au/Personal/dabbott/animal.html
Quack project http://www.quack-project.com/table.cgi
Fictional language dialog by Naila Burney: https://reactivemusic.net/?p=7242

Pig creatures example: http://vimeo.com/64543087

0:00 Neutral
0:32 Single morphemes – neutral mode
0:37 Series, with unifying sounds and breaths
1:02 Neutral, layered
1:12 Sad
1:26 Angry
1:44 More Angry
2:11 Happy

What about Jar Jar Binks?

Accents

The sound changes but the words remain the same.

The Speech accent archive https://reactivemusic.net/?p=9436

Finding and removing music in speech

We are always singing.

Jamming with speech

Drummer jams with a speed-talking auctioneer: https://reactivemusic.net/?p=7140
Guitarist imitates crying politician: http://digg.com/video/guitarist-plays-along-to-sobbing-japanese-politician

Removing music from speech

SMS-tools

by Xavier Serra and UPF

Harmonic Model Plus Residual (HPR) – Build a spectrogram using STFT, then identify where there is strong correlation to a tonal harmonic structure (music). This is the harmonic model of the sound. Subtract it from the original spectrogram to get the residual (noise).

Settings for above example:

Window size: 1800 (SR / f0 * lobeWidth) 44100 / 200 * 8 = 1764
FFT size: 2048
Mag threshold: -90
Max harmonics: 30
f0 min: 150
f0 max: 200

feature detection

time dependent
Low level features: harmonicity, amplitude, fundamental frequency
high level features: mood, genre, danceability

Acoustic Brainz: (typical analysis page) https://reactivemusic.net/?p=17641

Essentia (open source feature detection tools) https://github.com/MTG/essentia

Freesound (vast library of sounds): https://www.freesound.org – look at “similar sounds”

Removing voices from music

A sad thought

phase cancellation encryption

This method was used to send secret messages during world war 2. Its now used in cell phones to get rid of echo. Its also used in noise canceling headphones.

https://reactivemusic.net/?p=8879

max-projects/phase-cancellation/phase-cancellation-example.maxpat

Center channel subtraction

What is not left and not right?

Ableton Live – utility/difference device: https://reactivemusic.net/?p=1498 (Allison Krause example)

Local file: Ableton-teaching-examples/vocal-eliminator

More experiments

Synthesizing laughter
Bobby McFerrin: (pentatonic scale) http://www.ted.com/talks/bobby_mcferrin_hacks_your_brain_with_music.html
Alphabet vocals
- jii lighter https://reactivemusic.net/?p=6970
- Sesame St – Joan LaBarbara: http://www.youtube.com/watch?v=y819U6jBDog
Warping acapella tracks https://reactivemusic.net/?p=18046

Questions

Why do most people not like the recorded sound of their voice?
Can voice be used as a controller?
- (Imitone: http://imitone.com)
- Mari Kimura
How do you recognize voices?
Does speech recognition work with singing?
How does the Google Speech API know the difference between music and speech?
How can we listen to ultrasonic animal sounds?
What about animal translators?

March 25, 2014June 17, 2014

Speech Synthesis Programming Guide

From Apple

https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SpeechSynthesisProgrammingGuide/Introduction/Introduction.html

February 4, 2014June 25, 2014

Notes: Chatbots in Conversation

update 6/2014 – Now part of the Internet sensors projects: https://reactivemusic.net/?p=5859

original post

They can talk with each other… sort of.

Last spring I made a project that lets you talk with chatbots using speech recognition and synthesis. https://reactivemusic.net/?p=4710.

Yesterday I managed to get two instances of this program, running on two computers, using two chatbots, to talk with each other, through the air. Technical issues remain (see below). But there were moments of real interaction.

In the original project, a human pressed button in Max to start and stop recording speech. This has been automated. The program detects and records speech, using audio level sensing. The auto-recording sensor turns on a switch when the level hits a threshold, and turns off after a period of silence. Threshold level and duration of silence can be adjusted by the user. There is also a feedback gate that shuts off auto-record while the computer is converting speech to text, and ‘speaking’ a reply.

technical issues

The Google speech API has difficulty with some of the voices used by the Mac OS speech synthesizer. We’ll need to experiment to find which voices produce accurate results.
The overall levels produced by the builtin Macbook speakers is not quite enough to achieve clear communication. The auto-recorder missed the onset of speech sometimes. One solution would be to insert a click to trigger the recorder, just before the speech synthesizer begins the actual speech. Or to use external speakers, or a secondary “wired” connection.
It would be nice to have menus of chatbots and voices. Also to automate the start of a new conversation thread.
The button to start the audio detector had to be operated by key-press because pushing the trackpad on a MacBook makes too much noise and always triggers the audio level detector.
Occasionally a chat bot would deliver a long response, or one containing a web address. These were problematic for recognition and synthesis.

local files

tkzic/internetsensors/speech-to-google-text-api3.maxpat
tkzic/internetsensors/pandorabots-api2.maxpat
tkzic/internetsensors/text-to-speech3.maxpat

January 9, 2014June 16, 2014

Max speech synthesizer

by Jordan Smith at McGill University

http://www.music.mcgill.ca/~jordan/coursework/mumt307/speech_synth.html

January 7, 2013June 24, 2014

Conversation with a robot in Max

This project brings together several examples of API programming with Max. The pandorabots.api patch contains an example of using curl to generate an XML response file, then converts XML to JSON using a Python script. The resulting JSON file is read into Max and parsed using the [js] object.

Here is an audio recording of my conversation (using Max) with a text chatbot named ‘Chomsky’

‘Chomsky’ lives at http://pandorabots.com.

My voice gets recorded by Max then converted to text by the Google speech-api.

The text is passed to the Pandorabots API. The chatbot response gets spoken by the aka.speech external which uses the Mac OS built-in text-to-speech system.

Note: The above recording was processed with a ‘silence truncate’ effect because there were 3-5 second delays between responses. In realtime it has the feel of the Houston/Apollo dialogs.

pandorabots-api.maxpat (which handles chatbot responses) gets text input from speech-to-google-text-api2.maxpat – a patch that converts speech to text using the Google speech-API.

https://reactivemusic.net/?p=4690

The output (responses from chatbot) get sent to twitter-search-to-speech2.maxpat which “speaks” using the Mac OS text-to-speech program using the aka.speech external.

files

Max

speech-to-google-text-api2.maxpat
JSON-google-speech.js
pandorabots-api.maxpat
JSON-pandorabot.js
text-to-speech2.maxpat

externals:

[aka.speech] and [aka.shell] from http://www.iamas.ac.jp/~aka/max/ – download this external and add the folder to Options | File Preferences, in Max

[authorization]

none required

external programs:

sox: sox audio conversion program must be in the computer’s executable file path, ie., /usr/bin – or you can rewrite the [sprintf] input to [aka.shell] with the actual path. Get sox from: http://sox.sourceforge.net
xml2json (python) in tkzic/internetsensors/: xml2json/xml2json.py and xml2json/setup.py (for translating XML to JSON) – [NOTE] you will need to change the path in the [sprintf] object in pandorabots.api to point to the folder containing this python script.

instructions

Open the three Max patches.
- speech-to-google-text-api2.maxpat
- pandorabots-api.maxpat
- text-to-speech2.maxpat
Clear the custid in the pandorabots-api patch
Start audio in the Google speech patch. Then toggle the mic button and say something.
After the first response, go to the pandorabots-api patch and click the new custid – so that the chatbot retains the thread of the conversation.

download:

The files for this project can be downloaded from the intenet-sensors archive at github

https://github.com/tkzic/internet-sensors

January 6, 2013January 21, 2024

Speech to text in Max

Using the Google speech API

(updated locally 1/21/2024 – changed binary path to sox for homebrew /opt/homebrew/bin/sox in [p call-google-speech]

Also changed some of the UI and logic for manual writing and sending.

(updated 1/21/2021)

This project demonstrates the Google speech-API. It records speech in Max, process it using the Google API, and displays the result in a Max [message] object.

download

https://github.com/tkzic/internet-sensors

folder: google-speech

files

main patch

speech-to-google-text-api6.maxpat

abstractions and other files

JSON-google-speech.js (parses JSON response from Google API)
ms-counter.maxpat (manages audio recording buffer)

external Max objects

[shell] from https://github.com/jeremybernstein/shell/releases/tag/1.0b2 download this external and add the folder to Options | File Preferences, in Max

external programs

sox: sox audio conversion program must be in the computer’s executable file path, ie., /usr/bin – or you can rewrite the [sprintf] input to [aka.shell] with the actual path. In our case we installed sox using Macports. The executable path is /opt/local/bin/sox – which is built into a message object in the subpatcher [call-google-speech]

get sox from: http://sox.sourceforge.net

note: this conversion may not be necessary with recent updates to Max and the Google speech API

authorization

none required – so far

This may be changing.

Insert here: how to get a speech-api key from Google

instructions

Open Max patch: speech-to-google-text-api6
Turn on audio
Press the spacebar. Start talking. Press the spacebar again when you are finished. The translation will begin automatically

Note: If you have a slow internet connection you may need to tweak the various delay times in the [call google-speech] sub patch.

send Tweets using speech

Max [send] and [receive] objects pass data from this project to other projects that send Tweets from Max. Just run the patches at the same time.

Using curl: https://reactivemusic.net/?p=5447
Using ruby: https://reactivemusic.net/?p=5818

Also, check out how this project is integrated into the Pandorabots chatbot API project

https://reactivemusic.net/?p=9834

Or anything else. The Google translation is amazingly accurate.

revision history

4/24/2016: need to have explicit path to sox, in the call-google-speech subpatch. In my Macports version the path is /usr/local/opt/bin/sox.
5/11/2014: The newest version requires Max 6.1.7 (for JSON parsing). Also have updated to Google Speech API v2.
update 3/26/2014 to use auto-record features developed for chatbot conversations

December 11, 2012July 1, 2014

Speech recognition in Max

(update 6/2014): its easier to use the Google speech-api by calling it from curl. See recent examples at: https://reactivemusic.net/?p=4690

original post:

from Luke Hall in the c74 forum:

http://cycling74.com/forums/topic.php?id=18403

I’ve used Macspeech Dictate in this way. In fact it uses the same speech recognition engine as Dragon Naturally Speaking, it works very well but you could potentially run into the same problems as CJ described above.

Another way to achieve this on a mac is using the built in voice recognition and applescripts and extra suites, which is an applescript extension that extends the range of what you can do, including letting you send key presses.

1. Turn on “speakable items” from system preferences > speech > speech recognition.
2. Open max.
3. Open script editor and write a script like this:

tell application “MaxMSP” to activate
tell application “Extra Suites”
ES type key “1”
end tell

4. Save it in library > speech > speakable items > application speakable items > maxmsp and name the file whatever you want the voice command to be, for example “press one”
6. Now on the floating speech icon click the down arrow at the bottom and “open speech commands window”. With max as the front-most application check that the commands you just saved as applescripts have appeared in the maxmsp folder.
7. Now simply hook up a [key] object in max, press “escape” (or whichever key you have set up to turn speech recognition on) and say “press one” and you should have [key] spit out “49”!

Sorry about the length explanation I hope it makes sense to you and gives you another possible (and cheaper!) method of obtaining you goals.

Oh and the applescript extension can be downloaded from: http://www.kanzu.com/