Supporting text-to-speech (and other text handling/processing workflows)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Supporting text-to-speech (and other text handling/processing workflows)

Reece Dunn
Hi,

I have been looking at creating a text-to-speech engine and supporting
GUI. In theory, these can fit nicely into GStreamer, as they take text
and convert it to audio (which you can then plug into a GStreamer
backend for playback or recording). There are some aspects of
text-to-speech (event notifications; data view; text-to-text
workflows) that I am not sure fit directly into the GStreamer model.

Anyway, here are my current thoughts on the architecture of a
text-to-speech engine (without going into the details of how
text-to-phoneme and phoneme-to-text is handled).

----- 8< -----

# Data Sources: file; string buffer; stdin
# Data Sinks: file; string buffer; stdout
# Readers: source => stream
# Writers: stream => sink
# Archives/Compression (Readers): zip; flate; gzip; ...

   1. Archive Offset -- position of the first byte in the specified
file in the archive
   2. File Name -- name of the current file source

# Encodings (Readers/Writers): ascii; utf8; ...

   1. Raw Byte Offset -- position in the stream in bytes
   2. Encoded Character Offset -- position in the stream in characters
   3. Need to change encodings -- e.g. xml encoding attribute (ascii
=> utf8; ...) and html meta/content-type tag

# File Formats (Readers): text; html; pdf; epub; odf; rtf; ssml; smil; ...

   1. Stream Offset -- byte/character offset in the raw data stream
(what to do when changing encodings?)
   2. Text Offset -- character offset in the text
   3. Viewer -- presenting the file in a text reader (Gtk+; Qt; ncurses; ...)
   4. File formats may change data source (zipped stream; multi-file
format; ...)
   5. File Reader: Data Source => Archive/Compression => Encoding => File Format
   6. Some formats (e.g. SSML) require understanding phoneme sets:
need to pass this as a phoneme stream
   7. Need a meta-format to transform the source to:
         1. text sequence -- offset/file information; language (may be
different languages; pass xml:lang data; ...); text
         2. phoneme sequence -- offset/file information; phoneme set; prosody
         3. additional instructions -- pauses; volume; rate; pitch; ...
         4. audio files/data? -- e.g. from ssml or smil data
   8. Should support reading/writing the wire format from the File
Format Reader/Writer
         1. format identification
         2. versioning
         3. byte order? -- for binary data (audio; anything else?)
         4. meta-data? -- RDF/Turtle?
         5. encoding? -- text; phoneme sequences; audio data

# Phoneme Sets (Readers/Writers): ipa; sampa; kirshenbaum; cmu-en_US;
festival-en_US; cepstral-[language]; ...

   1. IPA is a Unicode phoneme set -- U32 data stream
   2. The other phoneme sets use ascii characters only -- U8 data stream

# Workflows:

   1. File Reader => Text => Encoding => Data Sink
         1. Test a file reader (e.g. is it handling SSML data correctly).
   2. File Reader => Text => [Text-to-Phoneme] => Phonemes => Phoneme
Set => Encoding => Data Sink
         1. Record the phoneme sequence to a file.
         2. Useful for testing language rules.
         3. dictionary -- use a dictionary to look up words to give
the phoneme (and possibly parts-of-speech) sequence
         4. letter-to-phoneme -- use letter-to-phoneme rules for where
there is no dictionary match.
         5. accent/dialect -- apply accent/dialect phoneme-to-phoneme
transformation rules (e.g. /ɒ/ => /ɑ/ (cot-caught merger) in General
American).
         6. target phoneme set -- the phoneme set being written
(default=ipa+utf8)
         7. encoding -- the target encoding for the phoneme set to be
written out as (ascii; utf8; ...)
   3. Data Source => Encoding => Phoneme Set => Phonemes => Phoneme
Set => Encoding => Data Sink
         1. Phoneme set transcoding (e.g. Unicode IPA to Kirshenbaum).
         2. Useful for testing phoneme set support.
         3. source phoneme set -- the phoneme set being read (encode
in file stream? -- better than asking the user to know this)
         4. target phoneme set -- the phoneme set being written
(default=ipa+utf8)
         5. encoding -- the target encoding for the phoneme set to be
written out as (ascii; utf8; ...)
   4. File Reader => Text => [Text-to-Phoneme] => Phonemes =>
[Phoneme-to-audio] => Raw Audio => GStreamer
         1. Playback to an audio sink (alsa; oss; pulseaudio; jack;
portaudio; ...).
         2. Record to a file (raw pcm; wav; ogg; flac; ...).
         3. Hook into compatible media players (totem; ...).
         4. How to handle text-to-speech events (e.g. for highlighting
the current word being spoken; for playback progress; ...)?
   5. Other combinations/workflows are possible.

----- >8 -----

Some of this (character encodings, text-based file format readers,
etc.) is shared with other text/document viewers (okular, firefox,
chromium, ...), while other bits are shared with media players
(specifically the audio back end).

There are also other text-to-speech engines (eSpeak, festival,
Cepstral, ...) that support file in (text, ssml, ...) and audio out
for the 'Text => [Text-to-Phoneme] => Phonemes => [Phoneme-to-audio]
=> Raw Audio' part of the processing chain.

In addition to this, the system above is suited to text file
conversion workflows (e.g. pdf => text, odf => rdf, ...).

This could also be useful for accessibility APIs that make use of
text-to-speech (in gnome, kde and others).

So... can this be supported in GStreamer?

If so, how (my investigation didn't find any useful documentation on
writing your own sources/sinks, or different models)? Can it support
callbacks/events (e.g. for highlighting words being read)?

- Reece

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel
Reply | Threaded
Open this post in threaded view
|

Re: Supporting text-to-speech (and other text handling/processing workflows)

Stefan Sauer
Reece Dunn schrieb:

> Hi,
>
> I have been looking at creating a text-to-speech engine and supporting
> GUI. In theory, these can fit nicely into GStreamer, as they take text
> and convert it to audio (which you can then plug into a GStreamer
> backend for playback or recording). There are some aspects of
> text-to-speech (event notifications; data view; text-to-text
> workflows) that I am not sure fit directly into the GStreamer model.
>
> Anyway, here are my current thoughts on the architecture of a
> text-to-speech engine (without going into the details of how
> text-to-phoneme and phoneme-to-text is handled).

just go ahead and do it :)

text-to-speech : festival (ftlite would be nice)
speech-to-text : pocketsphinx

those are not perfect, but a good starting point. Now please write a
google-translate plugin with src-language and target-language parameters, use
sentence events from pocketsphinx to kick translations of the text via
google-web-service and voila - we have the star-trek universal translator.


But seriously, al that should generaly work. You might also want to look at
subtitle stuff which is handling sparse text streams.

Stefan

>
> ----- 8< -----
>
> # Data Sources: file; string buffer; stdin
> # Data Sinks: file; string buffer; stdout
> # Readers: source => stream
> # Writers: stream => sink
> # Archives/Compression (Readers): zip; flate; gzip; ...
>
>    1. Archive Offset -- position of the first byte in the specified
> file in the archive
>    2. File Name -- name of the current file source
>
> # Encodings (Readers/Writers): ascii; utf8; ...
>
>    1. Raw Byte Offset -- position in the stream in bytes
>    2. Encoded Character Offset -- position in the stream in characters
>    3. Need to change encodings -- e.g. xml encoding attribute (ascii
> => utf8; ...) and html meta/content-type tag
>
> # File Formats (Readers): text; html; pdf; epub; odf; rtf; ssml; smil; ...
>
>    1. Stream Offset -- byte/character offset in the raw data stream
> (what to do when changing encodings?)
>    2. Text Offset -- character offset in the text
>    3. Viewer -- presenting the file in a text reader (Gtk+; Qt; ncurses; ...)
>    4. File formats may change data source (zipped stream; multi-file
> format; ...)
>    5. File Reader: Data Source => Archive/Compression => Encoding => File Format
>    6. Some formats (e.g. SSML) require understanding phoneme sets:
> need to pass this as a phoneme stream
>    7. Need a meta-format to transform the source to:
>          1. text sequence -- offset/file information; language (may be
> different languages; pass xml:lang data; ...); text
>          2. phoneme sequence -- offset/file information; phoneme set; prosody
>          3. additional instructions -- pauses; volume; rate; pitch; ...
>          4. audio files/data? -- e.g. from ssml or smil data
>    8. Should support reading/writing the wire format from the File
> Format Reader/Writer
>          1. format identification
>          2. versioning
>          3. byte order? -- for binary data (audio; anything else?)
>          4. meta-data? -- RDF/Turtle?
>          5. encoding? -- text; phoneme sequences; audio data
>
> # Phoneme Sets (Readers/Writers): ipa; sampa; kirshenbaum; cmu-en_US;
> festival-en_US; cepstral-[language]; ...
>
>    1. IPA is a Unicode phoneme set -- U32 data stream
>    2. The other phoneme sets use ascii characters only -- U8 data stream
>
> # Workflows:
>
>    1. File Reader => Text => Encoding => Data Sink
>          1. Test a file reader (e.g. is it handling SSML data correctly).
>    2. File Reader => Text => [Text-to-Phoneme] => Phonemes => Phoneme
> Set => Encoding => Data Sink
>          1. Record the phoneme sequence to a file.
>          2. Useful for testing language rules.
>          3. dictionary -- use a dictionary to look up words to give
> the phoneme (and possibly parts-of-speech) sequence
>          4. letter-to-phoneme -- use letter-to-phoneme rules for where
> there is no dictionary match.
>          5. accent/dialect -- apply accent/dialect phoneme-to-phoneme
> transformation rules (e.g. /ɒ/ => /ɑ/ (cot-caught merger) in General
> American).
>          6. target phoneme set -- the phoneme set being written
> (default=ipa+utf8)
>          7. encoding -- the target encoding for the phoneme set to be
> written out as (ascii; utf8; ...)
>    3. Data Source => Encoding => Phoneme Set => Phonemes => Phoneme
> Set => Encoding => Data Sink
>          1. Phoneme set transcoding (e.g. Unicode IPA to Kirshenbaum).
>          2. Useful for testing phoneme set support.
>          3. source phoneme set -- the phoneme set being read (encode
> in file stream? -- better than asking the user to know this)
>          4. target phoneme set -- the phoneme set being written
> (default=ipa+utf8)
>          5. encoding -- the target encoding for the phoneme set to be
> written out as (ascii; utf8; ...)
>    4. File Reader => Text => [Text-to-Phoneme] => Phonemes =>
> [Phoneme-to-audio] => Raw Audio => GStreamer
>          1. Playback to an audio sink (alsa; oss; pulseaudio; jack;
> portaudio; ...).
>          2. Record to a file (raw pcm; wav; ogg; flac; ...).
>          3. Hook into compatible media players (totem; ...).
>          4. How to handle text-to-speech events (e.g. for highlighting
> the current word being spoken; for playback progress; ...)?
>    5. Other combinations/workflows are possible.
>
> ----- >8 -----
>
> Some of this (character encodings, text-based file format readers,
> etc.) is shared with other text/document viewers (okular, firefox,
> chromium, ...), while other bits are shared with media players
> (specifically the audio back end).
>
> There are also other text-to-speech engines (eSpeak, festival,
> Cepstral, ...) that support file in (text, ssml, ...) and audio out
> for the 'Text => [Text-to-Phoneme] => Phonemes => [Phoneme-to-audio]
> => Raw Audio' part of the processing chain.
>
> In addition to this, the system above is suited to text file
> conversion workflows (e.g. pdf => text, odf => rdf, ...).
>
> This could also be useful for accessibility APIs that make use of
> text-to-speech (in gnome, kde and others).
>
> So... can this be supported in GStreamer?
>
> If so, how (my investigation didn't find any useful documentation on
> writing your own sources/sinks, or different models)? Can it support
> callbacks/events (e.g. for highlighting words being read)?
>
> - Reece
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> _______________________________________________
> gstreamer-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gstreamer-devel


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
gstreamer-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gstreamer-devel