Hi,
I have been looking at creating a text-to-speech engine and supporting GUI. In theory, these can fit nicely into GStreamer, as they take text and convert it to audio (which you can then plug into a GStreamer backend for playback or recording). There are some aspects of text-to-speech (event notifications; data view; text-to-text workflows) that I am not sure fit directly into the GStreamer model. Anyway, here are my current thoughts on the architecture of a text-to-speech engine (without going into the details of how text-to-phoneme and phoneme-to-text is handled). ----- 8< ----- # Data Sources: file; string buffer; stdin # Data Sinks: file; string buffer; stdout # Readers: source => stream # Writers: stream => sink # Archives/Compression (Readers): zip; flate; gzip; ... 1. Archive Offset -- position of the first byte in the specified file in the archive 2. File Name -- name of the current file source # Encodings (Readers/Writers): ascii; utf8; ... 1. Raw Byte Offset -- position in the stream in bytes 2. Encoded Character Offset -- position in the stream in characters 3. Need to change encodings -- e.g. xml encoding attribute (ascii => utf8; ...) and html meta/content-type tag # File Formats (Readers): text; html; pdf; epub; odf; rtf; ssml; smil; ... 1. Stream Offset -- byte/character offset in the raw data stream (what to do when changing encodings?) 2. Text Offset -- character offset in the text 3. Viewer -- presenting the file in a text reader (Gtk+; Qt; ncurses; ...) 4. File formats may change data source (zipped stream; multi-file format; ...) 5. File Reader: Data Source => Archive/Compression => Encoding => File Format 6. Some formats (e.g. SSML) require understanding phoneme sets: need to pass this as a phoneme stream 7. Need a meta-format to transform the source to: 1. text sequence -- offset/file information; language (may be different languages; pass xml:lang data; ...); text 2. phoneme sequence -- offset/file information; phoneme set; prosody 3. additional instructions -- pauses; volume; rate; pitch; ... 4. audio files/data? -- e.g. from ssml or smil data 8. Should support reading/writing the wire format from the File Format Reader/Writer 1. format identification 2. versioning 3. byte order? -- for binary data (audio; anything else?) 4. meta-data? -- RDF/Turtle? 5. encoding? -- text; phoneme sequences; audio data # Phoneme Sets (Readers/Writers): ipa; sampa; kirshenbaum; cmu-en_US; festival-en_US; cepstral-[language]; ... 1. IPA is a Unicode phoneme set -- U32 data stream 2. The other phoneme sets use ascii characters only -- U8 data stream # Workflows: 1. File Reader => Text => Encoding => Data Sink 1. Test a file reader (e.g. is it handling SSML data correctly). 2. File Reader => Text => [Text-to-Phoneme] => Phonemes => Phoneme Set => Encoding => Data Sink 1. Record the phoneme sequence to a file. 2. Useful for testing language rules. 3. dictionary -- use a dictionary to look up words to give the phoneme (and possibly parts-of-speech) sequence 4. letter-to-phoneme -- use letter-to-phoneme rules for where there is no dictionary match. 5. accent/dialect -- apply accent/dialect phoneme-to-phoneme transformation rules (e.g. /ɒ/ => /ɑ/ (cot-caught merger) in General American). 6. target phoneme set -- the phoneme set being written (default=ipa+utf8) 7. encoding -- the target encoding for the phoneme set to be written out as (ascii; utf8; ...) 3. Data Source => Encoding => Phoneme Set => Phonemes => Phoneme Set => Encoding => Data Sink 1. Phoneme set transcoding (e.g. Unicode IPA to Kirshenbaum). 2. Useful for testing phoneme set support. 3. source phoneme set -- the phoneme set being read (encode in file stream? -- better than asking the user to know this) 4. target phoneme set -- the phoneme set being written (default=ipa+utf8) 5. encoding -- the target encoding for the phoneme set to be written out as (ascii; utf8; ...) 4. File Reader => Text => [Text-to-Phoneme] => Phonemes => [Phoneme-to-audio] => Raw Audio => GStreamer 1. Playback to an audio sink (alsa; oss; pulseaudio; jack; portaudio; ...). 2. Record to a file (raw pcm; wav; ogg; flac; ...). 3. Hook into compatible media players (totem; ...). 4. How to handle text-to-speech events (e.g. for highlighting the current word being spoken; for playback progress; ...)? 5. Other combinations/workflows are possible. ----- >8 ----- Some of this (character encodings, text-based file format readers, etc.) is shared with other text/document viewers (okular, firefox, chromium, ...), while other bits are shared with media players (specifically the audio back end). There are also other text-to-speech engines (eSpeak, festival, Cepstral, ...) that support file in (text, ssml, ...) and audio out for the 'Text => [Text-to-Phoneme] => Phonemes => [Phoneme-to-audio] => Raw Audio' part of the processing chain. In addition to this, the system above is suited to text file conversion workflows (e.g. pdf => text, odf => rdf, ...). This could also be useful for accessibility APIs that make use of text-to-speech (in gnome, kde and others). So... can this be supported in GStreamer? If so, how (my investigation didn't find any useful documentation on writing your own sources/sinks, or different models)? Can it support callbacks/events (e.g. for highlighting words being read)? - Reece ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
Reece Dunn schrieb:
> Hi, > > I have been looking at creating a text-to-speech engine and supporting > GUI. In theory, these can fit nicely into GStreamer, as they take text > and convert it to audio (which you can then plug into a GStreamer > backend for playback or recording). There are some aspects of > text-to-speech (event notifications; data view; text-to-text > workflows) that I am not sure fit directly into the GStreamer model. > > Anyway, here are my current thoughts on the architecture of a > text-to-speech engine (without going into the details of how > text-to-phoneme and phoneme-to-text is handled). just go ahead and do it :) text-to-speech : festival (ftlite would be nice) speech-to-text : pocketsphinx those are not perfect, but a good starting point. Now please write a google-translate plugin with src-language and target-language parameters, use sentence events from pocketsphinx to kick translations of the text via google-web-service and voila - we have the star-trek universal translator. But seriously, al that should generaly work. You might also want to look at subtitle stuff which is handling sparse text streams. Stefan > > ----- 8< ----- > > # Data Sources: file; string buffer; stdin > # Data Sinks: file; string buffer; stdout > # Readers: source => stream > # Writers: stream => sink > # Archives/Compression (Readers): zip; flate; gzip; ... > > 1. Archive Offset -- position of the first byte in the specified > file in the archive > 2. File Name -- name of the current file source > > # Encodings (Readers/Writers): ascii; utf8; ... > > 1. Raw Byte Offset -- position in the stream in bytes > 2. Encoded Character Offset -- position in the stream in characters > 3. Need to change encodings -- e.g. xml encoding attribute (ascii > => utf8; ...) and html meta/content-type tag > > # File Formats (Readers): text; html; pdf; epub; odf; rtf; ssml; smil; ... > > 1. Stream Offset -- byte/character offset in the raw data stream > (what to do when changing encodings?) > 2. Text Offset -- character offset in the text > 3. Viewer -- presenting the file in a text reader (Gtk+; Qt; ncurses; ...) > 4. File formats may change data source (zipped stream; multi-file > format; ...) > 5. File Reader: Data Source => Archive/Compression => Encoding => File Format > 6. Some formats (e.g. SSML) require understanding phoneme sets: > need to pass this as a phoneme stream > 7. Need a meta-format to transform the source to: > 1. text sequence -- offset/file information; language (may be > different languages; pass xml:lang data; ...); text > 2. phoneme sequence -- offset/file information; phoneme set; prosody > 3. additional instructions -- pauses; volume; rate; pitch; ... > 4. audio files/data? -- e.g. from ssml or smil data > 8. Should support reading/writing the wire format from the File > Format Reader/Writer > 1. format identification > 2. versioning > 3. byte order? -- for binary data (audio; anything else?) > 4. meta-data? -- RDF/Turtle? > 5. encoding? -- text; phoneme sequences; audio data > > # Phoneme Sets (Readers/Writers): ipa; sampa; kirshenbaum; cmu-en_US; > festival-en_US; cepstral-[language]; ... > > 1. IPA is a Unicode phoneme set -- U32 data stream > 2. The other phoneme sets use ascii characters only -- U8 data stream > > # Workflows: > > 1. File Reader => Text => Encoding => Data Sink > 1. Test a file reader (e.g. is it handling SSML data correctly). > 2. File Reader => Text => [Text-to-Phoneme] => Phonemes => Phoneme > Set => Encoding => Data Sink > 1. Record the phoneme sequence to a file. > 2. Useful for testing language rules. > 3. dictionary -- use a dictionary to look up words to give > the phoneme (and possibly parts-of-speech) sequence > 4. letter-to-phoneme -- use letter-to-phoneme rules for where > there is no dictionary match. > 5. accent/dialect -- apply accent/dialect phoneme-to-phoneme > transformation rules (e.g. /ɒ/ => /ɑ/ (cot-caught merger) in General > American). > 6. target phoneme set -- the phoneme set being written > (default=ipa+utf8) > 7. encoding -- the target encoding for the phoneme set to be > written out as (ascii; utf8; ...) > 3. Data Source => Encoding => Phoneme Set => Phonemes => Phoneme > Set => Encoding => Data Sink > 1. Phoneme set transcoding (e.g. Unicode IPA to Kirshenbaum). > 2. Useful for testing phoneme set support. > 3. source phoneme set -- the phoneme set being read (encode > in file stream? -- better than asking the user to know this) > 4. target phoneme set -- the phoneme set being written > (default=ipa+utf8) > 5. encoding -- the target encoding for the phoneme set to be > written out as (ascii; utf8; ...) > 4. File Reader => Text => [Text-to-Phoneme] => Phonemes => > [Phoneme-to-audio] => Raw Audio => GStreamer > 1. Playback to an audio sink (alsa; oss; pulseaudio; jack; > portaudio; ...). > 2. Record to a file (raw pcm; wav; ogg; flac; ...). > 3. Hook into compatible media players (totem; ...). > 4. How to handle text-to-speech events (e.g. for highlighting > the current word being spoken; for playback progress; ...)? > 5. Other combinations/workflows are possible. > > ----- >8 ----- > > Some of this (character encodings, text-based file format readers, > etc.) is shared with other text/document viewers (okular, firefox, > chromium, ...), while other bits are shared with media players > (specifically the audio back end). > > There are also other text-to-speech engines (eSpeak, festival, > Cepstral, ...) that support file in (text, ssml, ...) and audio out > for the 'Text => [Text-to-Phoneme] => Phonemes => [Phoneme-to-audio] > => Raw Audio' part of the processing chain. > > In addition to this, the system above is suited to text file > conversion workflows (e.g. pdf => text, odf => rdf, ...). > > This could also be useful for accessibility APIs that make use of > text-to-speech (in gnome, kde and others). > > So... can this be supported in GStreamer? > > If so, how (my investigation didn't find any useful documentation on > writing your own sources/sinks, or different models)? Can it support > callbacks/events (e.g. for highlighting words being read)? > > - Reece > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > gstreamer-devel mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/gstreamer-devel ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ gstreamer-devel mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gstreamer-devel |
Free forum by Nabble | Edit this page |