GStreamer-devel

Supporting text-to-speech (and other text handling/processing workflows)

Classic

List

Threaded

2 messages Options

Reece Dunn

Supporting text-to-speech (and other text handling/processing workflows)

Hi,

I have been looking at creating a text-to-speech engine and supporting
GUI. In theory, these can fit nicely into GStreamer, as they take text
and convert it to audio (which you can then plug into a GStreamer
backend for playback or recording). There are some aspects of
text-to-speech (event notifications; data view; text-to-text
workflows) that I am not sure fit directly into the GStreamer model.

Anyway, here are my current thoughts on the architecture of a
text-to-speech engine (without going into the details of how
text-to-phoneme and phoneme-to-text is handled).

----- 8< -----

# Data Sources: file; string buffer; stdin
# Data Sinks: file; string buffer; stdout
# Readers: source => stream
# Writers: stream => sink
# Archives/Compression (Readers): zip; flate; gzip; ...

1. Archive Offset -- position of the first byte in the specified
file in the archive
2. File Name -- name of the current file source

# Encodings (Readers/Writers): ascii; utf8; ...

1. Raw Byte Offset -- position in the stream in bytes
2. Encoded Character Offset -- position in the stream in characters
3. Need to change encodings -- e.g. xml encoding attribute (ascii
=> utf8; ...) and html meta/content-type tag

# File Formats (Readers): text; html; pdf; epub; odf; rtf; ssml; smil; ...

1. Stream Offset -- byte/character offset in the raw data stream
(what to do when changing encodings?)
2. Text Offset -- character offset in the text
3. Viewer -- presenting the file in a text reader (Gtk+; Qt; ncurses; ...)
4. File formats may change data source (zipped stream; multi-file
format; ...)
5. File Reader: Data Source => Archive/Compression => Encoding => File Format
6. Some formats (e.g. SSML) require understanding phoneme sets:
need to pass this as a phoneme stream
7. Need a meta-format to transform the source to:
1. text sequence -- offset/file information; language (may be
different languages; pass xml:lang data; ...); text
2. phoneme sequence -- offset/file information; phoneme set; prosody
3. additional instructions -- pauses; volume; rate; pitch; ...
4. audio files/data? -- e.g. from ssml or smil data
8. Should support reading/writing the wire format from the File
Format Reader/Writer
1. format identification
2. versioning
3. byte order? -- for binary data (audio; anything else?)
4. meta-data? -- RDF/Turtle?
5. encoding? -- text; phoneme sequences; audio data

# Phoneme Sets (Readers/Writers): ipa; sampa; kirshenbaum; cmu-en_US;
festival-en_US; cepstral-[language]; ...

1. IPA is a Unicode phoneme set -- U32 data stream
2. The other phoneme sets use ascii characters only -- U8 data stream

# Workflows:

1. File Reader => Text => Encoding => Data Sink
1. Test a file reader (e.g. is it handling SSML data correctly).
2. File Reader => Text => [Text-to-Phoneme] => Phonemes => Phoneme
Set => Encoding => Data Sink
1. Record the phoneme sequence to a file.
2. Useful for testing language rules.
3. dictionary -- use a dictionary to look up words to give
the phoneme (and possibly parts-of-speech) sequence
4. letter-to-phoneme -- use letter-to-phoneme rules for where
there is no dictionary match.
5. accent/dialect -- apply accent/dialect phoneme-to-phoneme
transformation rules (e.g. /ɒ/ => /ɑ/ (cot-caught merger) in General
American).
6. target phoneme set -- the phoneme set being written
(default=ipa+utf8)
7. encoding -- the target encoding for the phoneme set to be
written out as (ascii; utf8; ...)
3. Data Source => Encoding => Phoneme Set => Phonemes => Phoneme
Set => Encoding => Data Sink
1. Phoneme set transcoding (e.g. Unicode IPA to Kirshenbaum).
2. Useful for testing phoneme set support.
3. source phoneme set -- the phoneme set being read (encode
in file stream? -- better than asking the user to know this)
4. target phoneme set -- the phoneme set being written
(default=ipa+utf8)
5. encoding -- the target encoding for the phoneme set to be
written out as (ascii; utf8; ...)
4. File Reader => Text => [Text-to-Phoneme] => Phonemes =>
[Phoneme-to-audio] => Raw Audio => GStreamer
1. Playback to an audio sink (alsa; oss; pulseaudio; jack;
portaudio; ...).
2. Record to a file (raw pcm; wav; ogg; flac; ...).
3. Hook into compatible media players (totem; ...).
4. How to handle text-to-speech events (e.g. for highlighting
the current word being spoken; for playback progress; ...)?
5. Other combinations/workflows are possible.

----- >8 -----

Some of this (character encodings, text-based file format readers,
etc.) is shared with other text/document viewers (okular, firefox,
chromium, ...), while other bits are shared with media players
(specifically the audio back end).

There are also other text-to-speech engines (eSpeak, festival,
Cepstral, ...) that support file in (text, ssml, ...) and audio out
for the 'Text => [Text-to-Phoneme] => Phonemes => [Phoneme-to-audio]
=> Raw Audio' part of the processing chain.

In addition to this, the system above is suited to text file
conversion workflows (e.g. pdf => text, odf => rdf, ...).

This could also be useful for accessibility APIs that make use of
text-to-speech (in gnome, kde and others).

So... can this be supported in GStreamer?

If so, how (my investigation didn't find any useful documentation on
writing your own sources/sinks, or different models)? Can it support
callbacks/events (e.g. for highlighting words being read)?

- Reece

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel

Stefan Sauer

Re: Supporting text-to-speech (and other text handling/processing workflows)

Reece Dunn schrieb:

> Hi,
>
> I have been looking at creating a text-to-speech engine and supporting
> GUI. In theory, these can fit nicely into GStreamer, as they take text
> and convert it to audio (which you can then plug into a GStreamer
> backend for playback or recording). There are some aspects of
> text-to-speech (event notifications; data view; text-to-text
> workflows) that I am not sure fit directly into the GStreamer model.
>
> Anyway, here are my current thoughts on the architecture of a
> text-to-speech engine (without going into the details of how
> text-to-phoneme and phoneme-to-text is handled).

just go ahead and do it :)

text-to-speech : festival (ftlite would be nice)
speech-to-text : pocketsphinx

those are not perfect, but a good starting point. Now please write a
google-translate plugin with src-language and target-language parameters, use
sentence events from pocketsphinx to kick translations of the text via
google-web-service and voila - we have the star-trek universal translator.

But seriously, al that should generaly work. You might also want to look at
subtitle stuff which is handling sparse text streams.

Stefan

>
> ----- 8< -----
>
> # Data Sources: file; string buffer; stdin
> # Data Sinks: file; string buffer; stdout
> # Readers: source => stream
> # Writers: stream => sink
> # Archives/Compression (Readers): zip; flate; gzip; ...
>
> 1. Archive Offset -- position of the first byte in the specified
> file in the archive
> 2. File Name -- name of the current file source
>
> # Encodings (Readers/Writers): ascii; utf8; ...
>
> 1. Raw Byte Offset -- position in the stream in bytes
> 2. Encoded Character Offset -- position in the stream in characters
> 3. Need to change encodings -- e.g. xml encoding attribute (ascii
> => utf8; ...) and html meta/content-type tag
>
> # File Formats (Readers): text; html; pdf; epub; odf; rtf; ssml; smil; ...
>
> 1. Stream Offset -- byte/character offset in the raw data stream
> (what to do when changing encodings?)
> 2. Text Offset -- character offset in the text
> 3. Viewer -- presenting the file in a text reader (Gtk+; Qt; ncurses; ...)
> 4. File formats may change data source (zipped stream; multi-file
> format; ...)
> 5. File Reader: Data Source => Archive/Compression => Encoding => File Format
> 6. Some formats (e.g. SSML) require understanding phoneme sets:
> need to pass this as a phoneme stream
> 7. Need a meta-format to transform the source to:
> 1. text sequence -- offset/file information; language (may be
> different languages; pass xml:lang data; ...); text
> 2. phoneme sequence -- offset/file information; phoneme set; prosody
> 3. additional instructions -- pauses; volume; rate; pitch; ...
> 4. audio files/data? -- e.g. from ssml or smil data
> 8. Should support reading/writing the wire format from the File
> Format Reader/Writer
> 1. format identification
> 2. versioning
> 3. byte order? -- for binary data (audio; anything else?)
> 4. meta-data? -- RDF/Turtle?
> 5. encoding? -- text; phoneme sequences; audio data
>
> # Phoneme Sets (Readers/Writers): ipa; sampa; kirshenbaum; cmu-en_US;
> festival-en_US; cepstral-[language]; ...
>
> 1. IPA is a Unicode phoneme set -- U32 data stream
> 2. The other phoneme sets use ascii characters only -- U8 data stream
>
> # Workflows:
>
> 1. File Reader => Text => Encoding => Data Sink
> 1. Test a file reader (e.g. is it handling SSML data correctly).
> 2. File Reader => Text => [Text-to-Phoneme] => Phonemes => Phoneme
> Set => Encoding => Data Sink
> 1. Record the phoneme sequence to a file.
> 2. Useful for testing language rules.
> 3. dictionary -- use a dictionary to look up words to give
> the phoneme (and possibly parts-of-speech) sequence
> 4. letter-to-phoneme -- use letter-to-phoneme rules for where
> there is no dictionary match.
> 5. accent/dialect -- apply accent/dialect phoneme-to-phoneme
> transformation rules (e.g. /ɒ/ => /ɑ/ (cot-caught merger) in General
> American).
> 6. target phoneme set -- the phoneme set being written
> (default=ipa+utf8)
> 7. encoding -- the target encoding for the phoneme set to be
> written out as (ascii; utf8; ...)
> 3. Data Source => Encoding => Phoneme Set => Phonemes => Phoneme
> Set => Encoding => Data Sink
> 1. Phoneme set transcoding (e.g. Unicode IPA to Kirshenbaum).
> 2. Useful for testing phoneme set support.
> 3. source phoneme set -- the phoneme set being read (encode
> in file stream? -- better than asking the user to know this)
> 4. target phoneme set -- the phoneme set being written
> (default=ipa+utf8)
> 5. encoding -- the target encoding for the phoneme set to be
> written out as (ascii; utf8; ...)
> 4. File Reader => Text => [Text-to-Phoneme] => Phonemes =>
> [Phoneme-to-audio] => Raw Audio => GStreamer
> 1. Playback to an audio sink (alsa; oss; pulseaudio; jack;
> portaudio; ...).
> 2. Record to a file (raw pcm; wav; ogg; flac; ...).
> 3. Hook into compatible media players (totem; ...).
> 4. How to handle text-to-speech events (e.g. for highlighting
> the current word being spoken; for playback progress; ...)?
> 5. Other combinations/workflows are possible.
>
> ----- >8 -----
>
> Some of this (character encodings, text-based file format readers,
> etc.) is shared with other text/document viewers (okular, firefox,
> chromium, ...), while other bits are shared with media players
> (specifically the audio back end).
>
> There are also other text-to-speech engines (eSpeak, festival,
> Cepstral, ...) that support file in (text, ssml, ...) and audio out
> for the 'Text => [Text-to-Phoneme] => Phonemes => [Phoneme-to-audio]
> => Raw Audio' part of the processing chain.
>
> In addition to this, the system above is suited to text file
> conversion workflows (e.g. pdf => text, odf => rdf, ...).
>
> This could also be useful for accessibility APIs that make use of
> text-to-speech (in gnome, kde and others).
>
> So... can this be supported in GStreamer?
>
> If so, how (my investigation didn't find any useful documentation on
> writing your own sources/sinks, or different models)? Can it support
> callbacks/events (e.g. for highlighting words being read)?
>
> - Reece
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now. http://p.sf.net/sfu/bobj-july
> _______________________________________________
> gstreamer-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gstreamer-devel

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel