Groups
-
trunghoang12
Stream ASR (Speech to Text Online)
Strame ASR is a module that helps with real-time speech-to-text using Riva SDK from Nvidia.
It is built on Riva SDK using the Conformer Model and SocketIo to make it user-friendly.
Work Flow
-
Step 1: Connect to host https://agents.monkeyenglish.net (SocketIo)
-
Step 2: Each session to speech to text emit to event: "on_start"
If on_start is successful, it will response a message
{'status': 'Connected to server successful.'}
-
Step 3: When on_start is OKE. You will send data to the event "audio_stream"
-
Step 4: Response of audio_stream will be received by event "asr_response"
-
Step 5: Finish session please emit to event "on_end". When finishing it will return to the message
{'status': 'Stopped to server successful.'}
Code example
- For JS
const socket = io('https://agents.monkeyenglish.net'); // Handle connection socket.on('connect', () => { console.log('Connected to server'); }); // Handle disconnection socket.on('disconnect', () => { console.log('Disconnected from server'); }); // Handle ASR response socket.on('asr_response', (data) => { console.log('Received ASR response:', data); }); // Function to send audio data function pushAudioStream(audioData) { socket.emit('audio_stream', audioData); }
// Example of reading log file and sending data (implement as needed)
- For Python
import socketio import threading import time # Create a Socket.IO client sio = socketio.Client() # Event handler for connection @sio.event def connect(): print('Connected to server') # Event handler for disconnection @sio.event def disconnect(): print('Disconnected from server') # Event handler for 'asr_response' event @sio.on('asr_response') def on_asr_response(data): print('Received ASR response:', data) # Function to push data to 'audio_stream' def push_audio_stream(audio_data: str): sio.emit('audio_stream', audio_data) # print(f'Pushed data to audio_stream: {audio_data}') # Function to read and push lines from the log file def stream_log_file(file_path: str): lines = [] with open(file_path, 'r') as file: for line in file: # Assuming the split logic you provided is correct lines.append(line.strip().split(" ")[1]) # Adjust based on your log format for line in lines: push_audio_stream(line) time.sleep(0.1) # Delay between sending lines # Function to handle the streaming and listening concurrently def start_streaming_and_listening(): # Start listening to the Socket.IO server sio.connect('https://agents.monkeyenglish.net') # Start a separate thread to stream the log file log_file_path = 'com.earlystart.monkeytalk-latest.log' stream_thread = threading.Thread(target=stream_log_file, args=(log_file_path,)) stream_thread.start() # Keep the main thread alive to listen for responses sio.wait() # Start the process if __name__ == "__main__": start_streaming_and_listening()
-
-
trunghoang12
Remove Background API Documentation
This API endpoint allows users to remove the background from an image.
HTTP Method:
POST
Endpoint
https://agents.monkeyenglish.net/api/v1/images/remove_bg
Headers:
- accept: application/json
Specifies the expected response format (JSON). - APIKEY: a813ec766197294184a938c331b08e7g
A unique API key used for authentication. - Content-Type: multipart/form-data
Required when uploading files.
Parameters:
- image (required):
The image file to be processed. You need to specify the image file using @filename in curl. Make sure to set the correct MIME type for the image.
Example curl Request:
curl -X 'POST' \ 'https://agents.monkeyenglish.net/api/v1/images/remove_bg' \ -H 'accept: application/json' \ -H 'APIKEY: a813ec766197294184a938c331b08e7g' \ -H 'Content-Type: multipart/form-data' \ -F 'image=@images.jpeg;type=image/jpeg'
- accept: application/json
-
trunghoang12
Domain: https://agents.monkeyenglish.net/
APIKEY: a813ec766197294184a938c331b08e7gTranslate Text API
Endpoint:
POST /translate
Description:
This endpoint translates text from a source language to a target language. It supports both basic and advanced translation, with additional options for context, area, and style in the advanced mode.
Request Headers:
Header Type Required Description APIKEY
String Yes API key for authorization Request Body:
Basic Translation:
Field Type Required Description source_lang
String Yes The language of the source text. target_lang
String Yes The language to translate to. sentence
String Yes The text to be translated. is_advance
Boolean No Set to False
for basic translation.Note: source_lang, target_lang có thể sử dụng tên quốc gia, hoặc code tên quốc gia theo bảng.
Advanced Translation (with additional optional fields):
Field Type Required Description source_lang
String Yes The language of the source text. target_lang
String Yes The language to translate to. sentence
String Yes The text to be translated. is_advance
Boolean Yes Set to True
for advanced translation.area
String No Specify the domain/area for translation (e.g., legal, medical). style
String No Specify the translation style (e.g., formal, informal). context
String No Provide additional context for the translation. Example Request (Basic Translation):
{ "source_lang": "en", "target_lang": "es", "sentence": "Hello, how are you?", "is_advance": false }
Example Response
{ "message": "success", "target": "Humanity is truly terrifying.", "audio_target": "", "data": { "vie": { "text": "Nhân loại thực sự đáng sợ.", "audio": "https://vnmedia2.monkeyuni.net/App/uploads/productivity/TW4TBoru1K0TAneo7qUc.wav" }, "eng": { "text": "Humanity is truly terrifying.", "audio": "https://vnmedia2.monkeyuni.net/App/uploads/productivity/h41XWUypOsAOIWUvWvvW.wav" } } }
Supported Languages:
code language script Source Target afr Afrikaans Latn Sp, Tx Tx amh Amharic Ethi Sp, Tx Tx arb Modern Standard Arabic Arab Sp, Tx Sp, Tx ary Moroccan Arabic Arab Sp, Tx Tx arz Egyptian Arabic Arab Sp, Tx Tx asm Assamese Beng Sp, Tx Tx ast Asturian Latn Sp -- azj North Azerbaijani Latn Sp, Tx Tx bel Belarusian Cyrl Sp, Tx Tx ben Bengali Beng Sp, Tx Sp, Tx bos Bosnian Latn Sp, Tx Tx bul Bulgarian Cyrl Sp, Tx Tx cat Catalan Latn Sp, Tx Sp, Tx ceb Cebuano Latn Sp, Tx Tx ces Czech Latn Sp, Tx Sp, Tx ckb Central Kurdish Arab Sp, Tx Tx cmn Mandarin Chinese Hans Sp, Tx Sp, Tx cmn_Hant Mandarin Chinese Hant Sp, Tx Sp, Tx cym Welsh Latn Sp, Tx Sp, Tx dan Danish Latn Sp, Tx Sp, Tx deu German Latn Sp, Tx Sp, Tx ell Greek Grek Sp, Tx Tx eng English Latn Sp, Tx Sp, Tx est Estonian Latn Sp, Tx Sp, Tx eus Basque Latn Sp, Tx Tx fin Finnish Latn Sp, Tx Sp, Tx fra French Latn Sp, Tx Sp, Tx fuv Nigerian Fulfulde Latn Sp, Tx Tx gaz West Central Oromo Latn Sp, Tx Tx gle Irish Latn Sp, Tx Tx glg Galician Latn Sp, Tx Tx guj Gujarati Gujr Sp, Tx Tx heb Hebrew Hebr Sp, Tx Tx hin Hindi Deva Sp, Tx Sp, Tx hrv Croatian Latn Sp, Tx Tx hun Hungarian Latn Sp, Tx Tx hye Armenian Armn Sp, Tx Tx ibo Igbo Latn Sp, Tx Tx ind Indonesian Latn Sp, Tx Sp, Tx isl Icelandic Latn Sp, Tx Tx ita Italian Latn Sp, Tx Sp, Tx jav Javanese Latn Sp, Tx Tx jpn Japanese Jpan Sp, Tx Sp, Tx kam Kamba Latn Sp -- kan Kannada Knda Sp, Tx Tx kat Georgian Geor Sp, Tx Tx kaz Kazakh Cyrl Sp, Tx Tx kea Kabuverdianu Latn Sp -- khk Halh Mongolian Cyrl Sp, Tx Tx khm Khmer Khmr Sp, Tx Tx kir Kyrgyz Cyrl Sp, Tx Tx kor Korean Kore Sp, Tx Sp, Tx lao Lao Laoo Sp, Tx Tx lit Lithuanian Latn Sp, Tx Tx ltz Luxembourgish Latn Sp -- lug Ganda Latn Sp, Tx Tx luo Luo Latn Sp, Tx Tx lvs Standard Latvian Latn Sp, Tx Tx mai Maithili Deva Sp, Tx Tx mal Malayalam Mlym Sp, Tx Tx mar Marathi Deva Sp, Tx Tx mkd Macedonian Cyrl Sp, Tx Tx mlt Maltese Latn Sp, Tx Sp, Tx mni Meitei Beng Sp, Tx Tx mya Burmese Mymr Sp, Tx Tx nld Dutch Latn Sp, Tx Sp, Tx nno Norwegian Nynorsk Latn Sp, Tx Tx nob Norwegian Bokmål Latn Sp, Tx Tx npi Nepali Deva Sp, Tx Tx nya Nyanja Latn Sp, Tx Tx oci Occitan Latn Sp -- ory Odia Orya Sp, Tx Tx pan Punjabi Guru Sp, Tx Tx pbt Southern Pashto Arab Sp, Tx Tx pes Western Persian Arab Sp, Tx Sp, Tx pol Polish Latn Sp, Tx Sp, Tx por Portuguese Latn Sp, Tx Sp, Tx ron Romanian Latn Sp, Tx Sp, Tx rus Russian Cyrl Sp, Tx Sp, Tx slk Slovak Latn Sp, Tx Sp, Tx slv Slovenian Latn Sp, Tx Tx sna Shona Latn Sp, Tx Tx snd Sindhi Arab Sp, Tx Tx som Somali Latn Sp, Tx Tx spa Spanish Latn Sp, Tx Sp, Tx srp Serbian Cyrl Sp, Tx Tx swe Swedish Latn Sp, Tx Sp, Tx swh Swahili Latn Sp, Tx Sp, Tx tam Tamil Taml Sp, Tx Tx tel Telugu Telu Sp, Tx Sp, Tx tgk Tajik Cyrl Sp, Tx Tx tgl Tagalog Latn Sp, Tx Sp, Tx tha Thai Thai Sp, Tx Sp, Tx tur Turkish Latn Sp, Tx Sp, Tx ukr Ukrainian Cyrl Sp, Tx Sp, Tx urd Urdu Arab Sp, Tx Sp, Tx uzn Northern Uzbek Latn Sp, Tx Sp, Tx vie Vietnamese Latn Sp, Tx Sp, Tx xho Xhosa Latn Sp -- yor Yoruba Latn Sp, Tx Tx yue Cantonese Hant Sp, Tx Tx zlm Colloquial Malay Latn Sp -- zsm Standard Malay Latn Tx Tx zul Zulu Latn Sp, Tx Tx Speech Translation API
Endpoint:
POST /speech/translate
Description:
This endpoint translates an uploaded audio file from a source language to a target language. It supports speech-to-text translation tasks.
Request Headers:
Header Type Required Description APIKEY
String Yes API key for authorization Request Body (Form-Data):
Field Type Required Description audio
File Yes The audio file to be translated. source
String Yes The language of the audio (e.g., en
for English).target
String Yes The language to translate the audio to. task
String No Translation task type. Default is S2TT
(Speech-to-Text-to-Translation).Example Request (Form-Data):
Key Value audio
(upload audio file) source
en
target
fr
task
S2TT
Note: Language code follows the above table.
Task: S2TT if only want to translate to text, S2ST to translate with output audio + text
target: accept multi-output Example "vie,eng,spa"Response:
Field Type Description status
String Status of the translation request. output
String The translated text or processed output. error
String Error message if applicable. Successful Response (200 OK):
{ "status": "success", "output": "Bonjour", "error": "" }
Error Response (500 Internal Server Error):
{ "status": "failure", "output": "", "error": "System encountered an unexpected error. <error message>" }
Error Handling:
401 Unauthorized
: Invalid API key.500 Internal Server Error
: System encountered an unexpected error.
Audio Streaming Client for Speech-to-Text and Translation (S2TT)
1. Overview
This document provides an overview of how to implement a client for streaming audio data to a WebSocket server that processes the data for speech-to-text-to-translation (S2TT) tasks. The system is designed to handle real-time audio streaming from clients, which can be built using various programming languages.
Key Components:
- WebSocket Server: The server receives audio data from the client, processes it, and returns results (e.g., transcriptions, translations).
- Client: Any client application (mobile, desktop, web) can stream audio to the server over WebSocket.
- Streaming Protocol: Audio data is chunked and transmitted in real-time, with metadata indicating task details such as the source language, target language, and processing task.
2. Communication Flow
2.1 Initial Connection
-
Client Connects to Server: The client establishes a WebSocket connection with the server at a predefined URI.
- Example WebSocket URI:
ws://<server-address>:<port>/ws/translate/<session_id>
wss://agents.monkeyenglish.net/ws/translate/123
session_id: random_string
- Example WebSocket URI:
-
Task Metadata: The client sends an initial message to define the task. This message includes:
- Source Language: The language of the audio input (e.g.,
eng
for English). - Target Language: The language for translation (e.g.,
vie
for Vietnamese). - Task Type: The processing task (e.g.,
S2TT
for Speech-to-Text-to-Translation).
Message Format (JSON):
{ "type": "start", "data": { "source": "eng", "target": "vie", "task": "S2TT" } }
- Source Language: The language of the audio input (e.g.,
2.2 Streaming Audio Data
-
Audio Streaming: The client reads and sends audio data in chunks to the server. Each chunk is a segment of the full audio file, mimicking real-time audio streaming.
- The audio data is converted into a byte stream for transmission.
-
Transmission Format:
- Audio chunks are transmitted in binary format (e.g., byte array).
- Each chunk is sent over the WebSocket connection, followed by a short delay to simulate real-time audio capture.
-
Streaming Example:
- For every audio chunk, the client sends the binary data over the established WebSocket connection.
- The client continues sending chunks until the entire audio file has been transmitted.
2.3 Task Completion
-
End of Transmission: After the client finishes sending all audio chunks, it sends a final message to the server indicating that the streaming is complete and the task can be processed.
Message Format (JSON):
{ "type": "do_task", "data": { "source": "eng", "target": "vie", "task": "S2TT" } }
-
Processing Response: The server processes the received audio, performing the requested task (e.g., transcription and translation). Once complete, the server responds with the result, which may include:
- Transcribed text.
- Translated text.
-
Response Format: The server sends a JSON message back to the client containing the task's result:
{
"message": "",
"data": {
"vie": {
"text": "thế là sáng hôm sau cái tin tôi về đến cổng còn phải thăm đường đã lan ra khóc sóng",
"audio": "https://vnmedia2.monkeyuni.net/App/uploads/productivity/8tihbdQvbQPHkcqmDntW.wav"
},
"eng": {
"text": "So the next morning, when I got back to the cage, I had to walk down the street to cry.",
"audio": "https://vnmedia2.monkeyuni.net/App/uploads/productivity/5Um6tQ1nzT3BfEqOYzx4.wav"
}
},
"status": "success"
}```3. Client Implementation Guidelines
3.1 Supported Languages
The client can be developed in any language that supports WebSocket communication, such as:
- JavaScript: Web-based applications.
- Python: Server-side or command-line tools.
- Java/Kotlin: Android applications.
- Swift: iOS applications.
- C#: Desktop or .NET applications.
3.2 WebSocket Library
Ensure that the client uses a WebSocket library suitable for your chosen programming language. Common libraries include:
- JavaScript: Native WebSocket API or popular libraries like
socket.io
. - Python:
websockets
orwebsocket-client
. - Java/Kotlin:
OkHttp
WebSocket implementation. - Swift:
Starscream
library for WebSocket communication.
3.3 Audio File Handling
The client needs to handle reading audio files or capturing audio in real-time. The format of the audio must be compatible with the server’s requirements (e.g., 16kHz, mono,
.wav
).3.4 Chunking and Streaming
The client should send audio data in small chunks. For real-time applications:
- Chunk Size: Each chunk should be small enough to allow near real-time transmission, typically between 1-3 seconds of audio data per chunk.
- Delay: Introduce a small delay (e.g., 1-10 milliseconds) between sending each chunk to simulate real-time streaming.
3.5 Error Handling
The client must handle potential errors during the WebSocket communication:
- Connection Issues: Reconnect if the WebSocket connection is dropped.
- Server Responses: Handle unexpected responses or errors from the server gracefully.
- Timeouts: Implement timeouts to prevent hanging connections if no response is received from the server.
4. Server Configuration
4.1 Server URI
Clients must connect to the WebSocket server at the following URI:
ws://<server-address>:<port>/ws/translate/<session_id>
<server-address>
: IP address or domain of the WebSocket server.<port>
: Port on which the server is running (e.g., 5001).<session_id>
: A unique identifier for the client session, generated for each streaming session.
4.2 Audio Processing
The server is responsible for:
- Receiving and buffering audio chunks.
- Processing the audio (speech recognition, translation).
- Sending results back to the client in the expected format.
5. Example Use Cases
5.1 Mobile Voice Translation App
A mobile app developed in Java or Swift captures the user's voice, streams the audio to the server using WebSocket, and receives the translated text, which is displayed to the user in real-time.
5.2 Web-Based Audio Translator
A JavaScript web application allows users to upload audio files. The app streams the audio to the server, processes it, and shows the translation results to the user.
5.3 Desktop Speech-to-Text Tool
A Python desktop application records audio from the microphone, streams it to the server, and displays real-time transcription and translation.
6. Conclusion
This document provides an overview of the WebSocket-based client-server system for real-time audio streaming and processing. The client can be implemented in any language with WebSocket support, allowing flexible integration across various platforms and applications.
-
trunghoang12
Recommend
Tài liệu về nghiên cứu và giải pháp Recomemd System .docx
-
trunghoang12
AI Converter
API Tools
- API Tool provides some API for features: Sync Text, Normalize Audio Mp3
- Sync Text for Audio API Documentation
Endpoint
URL:
https://aitools.monkeyenglish.net/segement
Method:
POST
Parameters
Name Type Description Required Example audio
File (mp3) Uploaded audio file in mp3 format Yes audio=@L2U6 - Chant 4 - Mr. Billy.mp3
karaoke_format
Boolean Option to convert to karaoke format or not Yes true
orfalse
text
String Text used for mapping sync text (Optional) No "Isn't this regular milk?"
Returns
Name Type Description data
Object Output of sync text function message
String Information about sync text successfully or state mapping status
Boolean True
if the system completes tasks completely, elseFalse
exception
String Empty if completed task, else contains exception message Example Request
curl -X 'POST' \ 'https://aitools.monkeyenglish.net/segement' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'audio=@L2U6 - Chant 4 - Mr. Billy.mp3;type=audio/mpeg' \ -F 'karaoke_format=false' \ -F 'text='
- API normalize mp3 file
Endpoint: https://aitools.monkeyenglish.net/normalize-audio`
Method: POST
Parameters
Name Type Description Required Example file
File (mp3) Uploaded audio file in mp3 format Yes file=@L2U6 - Chant 4 - Mr. Billy.mp3
Returns
Name Type Description data
Object Output of normalized audio file message
String Information about normalization success or failure status
Boolean True
if the system completes tasks completely, elseFalse
exception
String Empty if completed task, else contains exception message Example Request
curl -X 'POST' \ 'https://aitools.monkeyenglish.net/normalize-audio' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'file=@L2U6 - Chant 4 - Mr. Billy.mp3;type=audio/mpeg'
Score Audio from CSV
POST /score_audio
Description: Processes an uploaded CSV or Excel file and returns a CSV file with processed results.Request
Content-Type: multipart/form-data
Parameters:
file (required): An uploaded file in CSV or Excel format. The file should contain the data to be processed.
name_column_text (required): The name of the column in the file that contains text data.
name_column_audio (required): The name of the column in the file that contains audio data.
name_column_output (required): The name of the column where the processed results will be written.Convert Audio API
Base URL
The base URL for the API is:
https://aitools.monkeyenglish.net/
Endpoints
1. Convert MP3 to WAV
Endpoint:
POST /mp3-to-wav/
Description: Converts an MP3 file to a WAV file.
Request:
file
: The MP3 file to be converted. (required)
Response:
- Returns a WAV file with
Content-Disposition
header set toattachment; filename={original_filename}.wav
.
2. Convert WAV to MP3
Endpoint:
POST /wav-to-mp3/
Description: Converts a WAV file to an MP3 file.
Request:
file
: The WAV file to be converted. (required)
Response:
- Returns an MP3 file with
Content-Disposition
header set toattachment; filename={original_filename}.mp3
.
3. Normalize MP3
Endpoint:
POST /normalize-mp3/
Description: Normalizes the volume of an MP3 file.
Request:
file
: The MP3 file to be normalized. (required)
Response:
- Returns a normalized MP3 file with
Content-Disposition
header set toattachment; filename={original_filename}_normalized.mp3
.
4. Normalize WAV
Endpoint:
POST /normalize-wav/
Description: Normalizes the volume of a WAV file and optionally changes the sample rate and number of channels.
Request:
file
: The WAV file to be normalized. (required)sample_rate
: The sample rate for the output file. (optional, default: 44100)channels
: The number of channels for the output file. (optional, default: 2)
Response:
- Returns a normalized WAV file with
Content-Disposition
header set toattachment; filename={original_filename}_normalized.wav
.
5. Convert WAV to Normalized MP3
Endpoint:
POST /wav-normalize-mp3/
Description: Converts a WAV file to an MP3 file and normalizes the volume.
Request:
file
: The WAV file to be converted and normalized. (required)
Response:
- Returns a normalized MP3 file with
Content-Disposition
header set toattachment; filename={original_filename}_normalized.mp3
.
6. Convert MP3 to Custom WAV
Endpoint:
POST /mp3-custom-wav/
Description: Converts an MP3 file to a WAV file with custom sample rate and number of channels.
Request:
file
: The MP3 file to be converted. (required)sample_rate
: The sample rate for the output file. (optional, default: 44100)channels
: The number of channels for the output file. (optional, default: 2)
Response:
- Returns a WAV file with
Content-Disposition
header set toattachment; filename={original_filename}_normalized.wav
.
CLI Tools
- CLI Usage Document for
_sync_lip_sync.py
Overview
The
_sync_lip_sync.py
script is designed to synchronize lip sync data from a source directory to a destination directory. This script is executed using Python 3 and takes two arguments: the source directory and the destination directory.Prerequisites
- Ensure that Python 3 is installed on your system. You can download it from python.org.
- Make sure the
_sync_lip_sync.py
script is in your working directory or provide the full path to the script. - Ensure you have the necessary permissions to read from the source directory and write to the destination directory.
Usage
To run the
_sync_lip_sync.py
script, open your terminal or command prompt and execute the following command:python3 _sync_lip_sync.py <source_directory> <destination_directory>
- CLI Tool for Scoring Audio
Overview
This CLI tool is designed to score audio files based on transcripts provided in a CSV file. It downloads audio files from URLs specified in the CSV, processes them, and outputs the results to a specified CSV file.
Usage
To use the CLI tool, run the following command in your terminal:
```bash
python3 _scoring_audio_competition.py arg1 arg2 arg3 arg4
```Arguments
-
arg1
: Input filename (CSV)- The path to the input CSV file containing the transcripts and URLs.
-
arg2
: Column text- The name of the column in the CSV file that contains the transcripts.
-
arg3
: Column URL- The name of the column in the CSV file that contains the URLs of the audio files to be downloaded.
-
arg4
: Output filename (CSV)- The path to the output CSV file where the scoring results will be saved.
Detailed Instructions
-
Prepare Input CSV File
- Create a CSV file with at least two columns: one for the transcripts and one for the URLs of the audio files.
- Ensure the column names match the values you will provide for
arg2
andarg3
.
Example:
```csv
transcript,url
"This is a sample transcript","http://example.com/audio1.mp3"
"Another transcript","http://example.com/audio2.mp3"
``` -
Run the Tool
- Open your terminal.
- Navigate to the directory where
_scoring_audio_competition.py
is located. - Run the following command, replacing
input.csv
,transcript
,url
, andoutput.csv
with your actual file and column names:
```bash
python3 _scoring_audio_competition.py input.csv transcript url output.csv
``` -
Output
- The tool will download the audio files from the URLs specified in the input CSV.
- It will process the audio files and score them based on the provided transcripts.
- The results will be saved in the output CSV file specified.
Example Command
Here is an example command based on the provided instructions:
```bash
python3 _scoring_audio_competition.py example_input.csv transcript url example_output.csv
```In this example:
example_input.csv
is the input CSV file containing the transcripts and audio URLs.transcript
is the name of the column containing the transcripts.url
is the name of the column containing the audio URLs.example_output.csv
is the output CSV file where the scoring results will be saved.
Notes
- Ensure you have all necessary dependencies installed before running the script.
- The input CSV file should be well-formatted to avoid any errors during processing.
- The output CSV file will contain the results of the scoring process, which can be used for further analysis.
-
trunghoang12
- Install KIC on K8S
Kong Ingress Controller allows you to run Kong Gateway as a Kubernetes Ingress to handle inbound requests for a Kubernetes cluster.
Installation KIC on K8S: [link](https://docs.konghq.com/kubernetes-ingress-controller/latest/get-started/
- Add a Service into KONG
-
trunghoang12
Tài liệu kĩ thuật và tích hợp AI M-Speak Dialogue
- Sơ đồ tổng quan các thành phần hệ thống
- Core Dialogue: Chịu trách nhiệm xử lý các logic nghiệp vụ của dialogue, và giao tiếp với các thành phần khác trong hệ thống. Là phương thức để app kết nối thông qua GRPC
- ELS: Logging và monitoring các thông số khi được xử lý
- Triton Serving: Là một service thực hiện inference các model AI phục vụ trong quá trình dialogue xảy ra.
- OpenAI: Third party phục vụ trong mục đích generate phản hồi cho người dùng.
- Postgres: Database chính của Dialogue
- Redis: Caching
- Các modules chính trong Core Dialogue BE
Trong Core BE Dialog gồm có 5 modules chính:
2.1 Manager
- Bao gồm các class tương tác với database thực hiện các thao tác: select, update, insert, delete tương ứng với từng bảng.
-> Toàn bộ được sử dụng thông qua factory_cls.py (Factory design patterns)
Chi tiết các bảng tương ứng các class tuân theo design database bên dưới.
2.2 Modeling
- Gồm các class hỗ trợ suy luận các mô hình AI extend triton_client -> được sử dụng thông qua factory_ai.py
- Các function đã được define sẵn sử dụng với OpenAI.
2.3 Pipeline:
Các pipeline cho dialogue gồm 3 pipeline chính:
- General pipeline: Toàn bộ câu trả lời người dùng sẽ đi qua pipeline này
- Response pipeline: Tuy theo các case người dùng sẽ được điều hướng vào 1 trong 3 pipeline tuỳ theo dạng câu hỏi người dùng đang trả lời là: Yes/No question, Media question hay Openning question.
- Answer pipeline: Generate câu phản hồi cho người dùng cùng với action tiếp theo của người dùng cần thực hiện.
Sử dụng thông qua factory_pipeline.py
2.4 Resouce
- Tập hợp các file bao gồm file config, các file yaml lưu trữ default các logic và câu trả lời mẫu.
- File proto của service
2.5 Grpc_libs:
- Define toàn bộ logic in/out của service khi kết nối với core.
- Thiết kế database
ER Diagram Description
tbl_conversation
Column Name Description id Primary key. conversation_name Name of the conversation. description Textual description of the conversation. level Level of the conversation. voice The type of voice used. type Conversation type (format/style). num_tries Number of tries allowed. greeting Greeting message/phrase. greeting_media_url URL to media file for the greeting. ending Ending message/phrase. ending_media_url URL to media file for the ending. bundle_path Path to the conversation's bundled files. zip_path Path to the zip file containing conversation data. created_at Record creation timestamp. updated_at Last record update timestamp.
tbl_question
Column Name Description id Primary key. conversation_id Foreign key (links to tbl_conversation
).question The question text. media_url URL to media file for the question. index The order/index of the question in conversation. attribute_extend Additional attributes/metadata. created_at Record creation timestamp. updated_at Last record update timestamp. intent_condition Conditions for intent applicability.
tbl_intent
Column Name Description id Primary key. question_id Foreign key (links to tbl_question
).intent The intent associated with the question. response Response text for the intent. media_url URL to media file for the response. retrial Number of retry attempts allowed. created_at Record creation timestamp. updated_at Last record update timestamp.
Relationships:
- tbl_conversation: Has a one-to-many relationship with
tbl_question
viaconversation_id
. - tbl_question: Has a one-to-many relationship with
tbl_intent
viaquestion_id
.
- Các chức năng chính trong service bao gồm:
UC1: Tạo mới một cuộc hội thoại
UC2: Tạo mới cuộc hội thoại bằng file (updating...)
UC3: Lấy danh sách các cuộc hội thoại
UC4: Điều khiển hội thoại (Answer Comunication Talk)Để tạo mới một cuộc hội thoại cần chuẩn bị tên của hội thoại -> tương tứng với tên chủ đề.
Mỗi hội thoại bao gồm tập các câu hỏi và câu trả lời khác nhau.
Mỗi câu hỏi chỉ thuộc một trong 3 loại: Yes/No question, Media question, Opening questionFlow tích hợp dialogue
Postman document: Link
Bước 0: Khi người dùng chọn vào topic -> video introduction sẽ được play lên
B1: Khi hết intro video -> app gửi 1 request command start
Bước 2: Sau khi start thì hệ thống xác nhận một cuộc hội thoại mới và trả về một số thông tin
- session: thông tin về phiên làm việc của user, client quan tâm về session_id
- question: thông tin câu hỏi hiện tại của user
- reply: sẽ là câu phản hồi lại câu trả lời của người dùng. Nhưng do start thì câu đầu tiên nên người dùng sẽ chưa trả lời gì, vì vậy nên sẽ là câu hỏi bắt đầu luôn
Bước 3: Người dùng trả lời xong câu hỏi
Bước 4: Lặp lại đến khi hết các câu
Trong session data nếu thấy "is_last": true, có nghĩa đó là câu cuối rồi. Thì sẽ không có câu hỏi nữa. Thì sẽ play video ending về.WebSocket Client Usage Guide for Speech-Text
This document explains how to use a WebSocket client to send JSON data followed by audio data to a WebSocket server.
Prerequisites
Ensure you have the following prerequisites:
- A WebSocket server running and accessible.
- An audio file in WAV format (sample_rate = 16000, channels = 1)
Steps to Use the WebSocket Client
-
Prepare the Audio File:
- Ensure you have an audio file in WAV format that you want to send to the WebSocket server. (sample rate 16000, channels = 1)
-
Connect to the WebSocket Server:
- Establish a connection to the WebSocket server using the server's URI (Uniform Resource Identifier). The URI typically includes the protocol (ws or wss), the server's address, and the endpoint for the WebSocket connection. For example:
ws://localhost:8000/ws/{device_id}
- Establish a connection to the WebSocket server using the server's URI (Uniform Resource Identifier). The URI typically includes the protocol (ws or wss), the server's address, and the endpoint for the WebSocket connection. For example:
Domain dev:
wss://lipsync.monkeyenglish.net/ws/{device_id}
Domain live:wss://videocall.monkeyenglish.net/ws/{device_id}
Note: device_id is a unique string, can use profile Id, or others-
Send Audio Data:
- Read the audio data from the WAV file and send it as binary data to the WebSocket server. Make sure the audio data is sent after the JSON message.
-
Send JSON Data:
- Send a JSON message containing the context of the speech. The JSON data should have a key named
context
and the value should be the text describing the context. For example:{ "context": "This is a test context for speech-to-text conversion." }
- Ensure the JSON message is sent first before sending the audio data.
- Context is question, topic type or anything to scale scope of audio input -> prefer inserting question
- Send a JSON message containing the context of the speech. The JSON data should have a key named
-
Receive the Response:
- Wait for the WebSocket server to process the data and send a response. The response could be the result of the speech-to-text conversion or any other relevant information.
Example response:
{"text": "They are very beautiful"}
- Wait for the WebSocket server to process the data and send a response. The response could be the result of the speech-to-text conversion or any other relevant information.
-
Handle Disconnection:
- Be prepared to handle disconnections from the WebSocket server gracefully. Ensure that any resources or connections are properly closed.
Example Workflow
- Connect to the WebSocket server at
ws://localhost:8000/ws/{device_id}
. - Send JSON data with the key
context
and a relevant value. - Send audio data from a WAV file.
- Receive and process the server's response.
- Close the connection gracefully.
Notes
- Ensure that the audio data is in the correct format (e.g.,
float32
) before sending it. - If an error occurs during the process, handle it appropriately and attempt to reconnect if necessary.
- The WebSocket server should be configured to handle both JSON and binary data correctly.
API Docs chi tiết các API trong service
-
trunghoang12
This document provides an instruction to integrate the Monkey's model about handwriting to edge devices.
Tech Stack
- TensorFlow Lite
- Unity
Introduction
Handwriting is an AI model, which was converted to TensorFlow lite format supported by Google for multi-platforms.
The model can be downloaded at here1. Hyperparameters
Some configurations of the model:
Name config Description Value Input shape Shape of input: It is an image converted to size width x height x channel 128 x 128 x 3 (BGR) 1 x 128 x 128 x 3 Output It is a score array to predict each class. To get the prediction label, we will choose a max value in the output array 1 x 26 The mapping index to the prediction character will be shown below.
{"0": "a", "1": "b", "2": "c", "3": "d", "4": "e", "5": "f", "6": "g", "7": "h", "8": "i", "9": "j", "10": "k", "11": "l", "12": "m", "13": "n", "14": "o", "15": "p", "16": "q", "17": "r", "18": "s", "19": "t", "20": "u", "21": "v", "22": "w", "23": "x", "24": "y", "25": "z"}
Example: The model predicts for an image with output:
[[7.8083836e-03 1.0330592e-02 3.6540066e-04 1.1240702e-01 1.2986563e-01 8.1596321e-05 2.7041843e-03 1.8760953e-02 1.5376755e-03 8.4590465e-05 1.9241240e-02 2.4502007e-02 2.1457224e-01 1.2494331e-02 1.9096583e-02 2.9417273e-04 2.1153286e-02 1.8904490e-02 6.2950579e-03 3.8062898e-03 1.2752166e-01 2.5853007e-03 1.6490310e-01 3.3960843e-03 2.3415815e-03 7.4946553e-02]]
We can see that index 22 is the maximum value (index starts from 0) with a confidence score of 16%. Mapping to label, it is character 'v'.
3. Pre-install for model
The model requires an image with a shape of 128 x 128 x 3
The model could recognize exactly. The image should be handled by logic and requirements- Text: black
- Background: white
- Object: center of image
- Size of object = 50 - 60 % of the image.
- Resize image (128 x 128 x 3)
- Value type: Fp32
"""
def pre_process(self, img):# Cropping mask = img != 255 mask = mask.any(2) mask0,mask1 = mask.any(0),mask.any(1) colstart, colend = mask0.argmax(), len(mask0)-mask0[::-1].argmax()+1 rowstart, rowend = mask1.argmax(), len(mask1)-mask1[::-1].argmax()+1 img = img[rowstart:rowend, colstart:colend] img_h, img_w = img.shape[0], img.shape[1] # Padding # Create white image img_size = img_w if img_w > img_h else img_h img_size = int(1.9 * img_size) new_img = np.zeros([img_size , img_size , 3], dtype=np.uint8) new_img.fill(255) # insert text image into white image start_w = int((new_img.shape[1] - img_w ) /2) start_h = int((new_img.shape[0] - img_h ) /2) new_img[start_h : start_h + img_h,start_w : start_w + img_w ,:] = img[:,:,:] # thicken text. iterations = int(img_size / 128) if iterations > 1 : kernel = np.ones((5, 5), np.uint8) new_img = cv2.erode(new_img, kernel, iterations=iterations) # resize image new_img = cv2.resize(new_img, (self.size, self.size), interpolation = cv2.INTER_AREA) new_img = np.array(new_img, dtype=np.float32) return new_img
"""
Example input:
2. Installation
3. Usage API from Server
Live:https://app.monkeyenglish.net/mspeak/handwriting
Dev: https://ai.monkeyenglish.net/handwritingField Description URL Method POST Header 'APIKEY': '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' Body {"image" : string base 64 of an image}, "pre_process": true , "target" : "" Response { "status": true, "text": [ { "character": "z", "conf_score": 100.0 }, { "character": "o", "conf_score": 100.0 }, { "character": "c", "conf_score": 100.0 } ], "msg": ""
} |
- Note:
- pre_process: The system will pre-process the image to enhance the performance of the model
- target: "a" - text wants to compare with predict (nullable). You can push anything
- In response: text - is a prediction of the image, conf is a confident score (0-100%)
-
trunghoang12
Triton Serving for AI Model
In this repo at the branch "triton_serving", we will provide sources and guides for production model AI through TensorRt and Triton Serving.
Install requirements
Triton supports some platforms to bring your AI model to production with high performance, such as Torch, TensorFlow, TensorRT, Onnx, and Pure Python.
In some cases, if you want to run your code with pure Python, some third-party libraries are required. You should create a custom base image.
Creating a base image
I attached a Dockerfile to build a custom base image with requirements.txt.
To build a base image, please insert your libraries into requirements.txt. Don't forget to define the version.
docker build. -t <image_name>:<image_tag>
Note: You can change the image name and image tag in <image_name>:<image_tag>.
Converting model
You can use any framework to develop your model, such as TensorFlow, Python, etc. But the pure framework is quite slow in production. So I strongly recommend converting to another format, such as ONNX or TensorRT.
You can use any framework to develop your model, such as TensorFlow, Python, etc. But the pure framework is quite slow in production. So I strongly recommend converting to another format, such as ONNX or TensorRT.
While converting, you can set fp16 mode or int8 to speed up inference time. But let's remember that you need to check again about the precision of the model after it was converted.Two common cases:
-
ONNX with fp16 or fp32
-
TensorRT: fp16, fp32 or int8
1. Installation
If you want to install the TensorRT environment on your local machine, you can follow the instructions or documents
You can face some issues when installing on a local machine. You can check again about the version.
Another way is that it is easy to use and rapid to set up. Docker is a wonderful solution to address any problems related to installation.docker run -d --gpus all -it --rm -v ./:/workspace nvcr.io/nvidia/tensorrt:23.09-py3
To finish setting up Triton with Docker, follow the command below. It would be best if you mapped your workspace in the host machine into Docker workspace by argument -v.
The NVIDIA team was exposed to a stage for converting the model from deep learning frameworks to inference frameworks.
2. Converting to ONNX
The entire model should be converted to ONNX before being transferred to TensorRT. You can follow 2 instructions below to convert your model.
The Monkey's ONNX model was saved at S3: data-team-media-ai/model-zoo/onnx_model_zoo/
2. Converting to TensorRT
I used Docker to convert my model to TensorRT, you can refer my command below:
trtexec --onnx=models/openai/whisper-small/onnx/whisper-small-encoder.onnx --saveEngine='model.engine' --explicitBatch --workspace=1024
-
ONNX: path of the model ONNX
-
saveEngine: path of TensorRT model
-
explicitBatch: This option will allow for a fixed batch size.
-
workspace: the value allows to set maximum memory for each layer in the model
If you want to run fp16, or int8 add an argument into the command as:
trtexec --onnx=onnx/question_statement/torch-model.onnx --saveEngine='model.engine' --explicitBatch --workspace=1024 --fp16
If you want to set a dynamic axis for the TensorRT model:
trtexec --onnx=onnx/sentence_transformer/all-mpnet-base-v2.onnx --saveEngine='model.engine' --minShapes=input_ids:1x1,attention_mask:1x1 --optShapes=input_ids:1x15,attention_mask:1x15 --maxShapes=input_ids:1x384,attention_mask:1x384
We can export any name you want, but to identify the model, it is a TensorRT model or not? We should set the file extension to one in three [.plan, .trt, .engine]. But Triton only can see .plan file.
Serving Triton
After converting the model to TensorRT format, We can bring them to our production through Triton Serving.
Some steps to apply them to products:
- Create a model_repository
This folder will be used to covert entire your model.
model_repository | +-- handwriting | +-- config.pbtxt +-- 1 | +-- model.onnx
- Define the model config inside config.pbtxt:
name: "handwriting" platform: "onnxruntime_onnx" max_batch_size : 32 input [ { name: "input_1" data_type: TYPE_FP32 format: FORMAT_NHWC dims: [ 128, 128, 3 ] reshape { shape: [128, 128, 3 ] } } ] output [ { name: "dense_1" data_type: TYPE_FP32 dims: [ 26 ] reshape { shape: [ 26] } label_filename: "labels.txt" } ]
name: model name, it is the name of the folder too.
platform: env to run your model [onnxruntime_onnx, tensorrt_plan, torch, ...]
max_batch_size: the maximum batch size of the model
input: define the input of API
output: define the structure of the responseIn a model_repository, you can define many sub-folders; it is equivalent to a model.
After converting the model, don't forget to upload it to S3:
aws s3 sync model_repository s3://data-team-media-ai/model-zoo/triton/
- Serving
docker run -d --gpus all --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v/home/mautrung/edu_ai_triton_serving/model_repository:/models nvcr.io/nvidia/tritonserver:23.09-py3-custom tritonserver --model-repository=/models
Note: Some requirements about the device to run Triton
- A Nvidia GPU was installed.
- Docker and Docker-compose
- [Nvidia Container Toolkit] (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
Testing
Triton provides both protocols: GRPC (8001) and HTTP (8000).
Benchmark API from Triton
We can benchmark the model, that was started up by Triton by Apache Benchmark tool.
ab -p data_samples/body_bm_wav2vec/bm_w2v.txt -T application/json -c 100 -n 1000 http://localhost:8000/v2/models/wav2vec_trt/infer
- data_samples/body_bm_wav2vec/bm_w2v.txt: The file contains the body of the request. It has JSON format but it is saved into a .txt file.
- -c: concurrency requests.
- -n: number of requests will be used.
Some sample data in a folder: data_samples
Result:
-
-
trunghoang12
Tài liệu về back end web testing AI
Trong dự án BE sẽ cung cấp các API xử lý các luồng logic cho web AI tại link:
Nhiệm vụ của BE sẽ gồm:- Lưu trữ, cập nhật các thông tin người tham gia, các trạng thái của người tham gia
- Đề xuất các câu hỏi và xử lý logic đề thi theo từng level
- Kết nối với Mspeak Service để chấm điểm
- Kết nối với Kenisis để đồng bộ dữ liệu hoạt động với luồng của Report Service
Kenisis -> Lamda -> DynamoDB
Một số các thông tin cấu hình cho Report Service như sau:
- Kenisis: dev: ai_report, live: ai_report_production
- Lamda Function: dev: build_ai_testing_report , live: ai_webtest_production
- DynamDB Table: dev: ai_testing_report , live: ai_testing_report_production
- Database name: dev: edu_platform_6, live: edu_platform
Các tài liệu kĩ thuật của dự án theo dõi dưới đây: