Save
Saving
  • trunghoang12 trunghoang12

    Stream ASR (Speech to Text Online)

    Strame ASR is a module that helps with real-time speech-to-text using Riva SDK from Nvidia.

    It is built on Riva SDK using the Conformer Model and SocketIo to make it user-friendly.

    Work Flow

    • Step 1: Connect to host https://agents.monkeyenglish.net (SocketIo)

    • Step 2: Each session to speech to text emit to event: "on_start"
      If on_start is successful, it will response a message

    {'status': 'Connected to server successful.'}
    
    • Step 3: When on_start is OKE. You will send data to the event "audio_stream"

    • Step 4: Response of audio_stream will be received by event "asr_response"

    • Step 5: Finish session please emit to event "on_end". When finishing it will return to the message

    {'status': 'Stopped to server successful.'}
    

    Code example

    • For JS
    const socket = io('https://agents.monkeyenglish.net');
    
    // Handle connection
    socket.on('connect', () => {
        console.log('Connected to server');
    });
    
    // Handle disconnection
    socket.on('disconnect', () => {
        console.log('Disconnected from server');
    });
    
    // Handle ASR response
    socket.on('asr_response', (data) => {
        console.log('Received ASR response:', data);
    });
    
    // Function to send audio data
    function pushAudioStream(audioData) {
        socket.emit('audio_stream', audioData);
    }
    

    // Example of reading log file and sending data (implement as needed)

    • For Python
    import socketio
    import threading
    import time
    
    # Create a Socket.IO client
    sio = socketio.Client()
    
    # Event handler for connection
    @sio.event
    def connect():
        print('Connected to server')
    
    # Event handler for disconnection
    @sio.event
    def disconnect():
        print('Disconnected from server')
    
    # Event handler for 'asr_response' event
    @sio.on('asr_response')
    def on_asr_response(data):
        print('Received ASR response:', data)
    
    # Function to push data to 'audio_stream'
    def push_audio_stream(audio_data: str):
        sio.emit('audio_stream', audio_data)
        # print(f'Pushed data to audio_stream: {audio_data}')
    
    # Function to read and push lines from the log file
    def stream_log_file(file_path: str):
        lines = []
        with open(file_path, 'r') as file:
            for line in file:
                # Assuming the split logic you provided is correct
                lines.append(line.strip().split("  ")[1])  # Adjust based on your log format
        
        for line in lines:
            push_audio_stream(line)
            time.sleep(0.1)  # Delay between sending lines
    
    # Function to handle the streaming and listening concurrently
    def start_streaming_and_listening():
        # Start listening to the Socket.IO server
        sio.connect('https://agents.monkeyenglish.net')
    
        # Start a separate thread to stream the log file
        log_file_path = 'com.earlystart.monkeytalk-latest.log'
        stream_thread = threading.Thread(target=stream_log_file, args=(log_file_path,))
        stream_thread.start()
    
        # Keep the main thread alive to listen for responses
        sio.wait()
    
    # Start the process
    if __name__ == "__main__":
        start_streaming_and_listening()
    

    posted in Data_team read more
  • trunghoang12 trunghoang12

    Remove Background API Documentation

    This API endpoint allows users to remove the background from an image.

    HTTP Method:

    POST

    Endpoint

    https://agents.monkeyenglish.net/api/v1/images/remove_bg

    Headers:

    • accept: application/json
      Specifies the expected response format (JSON).
    • APIKEY: a813ec766197294184a938c331b08e7g
      A unique API key used for authentication.
    • Content-Type: multipart/form-data
      Required when uploading files.

    Parameters:

    • image (required):
      The image file to be processed. You need to specify the image file using @filename in curl. Make sure to set the correct MIME type for the image.

    Example curl Request:

    curl -X 'POST' \ 'https://agents.monkeyenglish.net/api/v1/images/remove_bg' \ -H 'accept: application/json' \ -H 'APIKEY: a813ec766197294184a938c331b08e7g' \ -H 'Content-Type: multipart/form-data' \ -F 'image=@images.jpeg;type=image/jpeg'

    posted in Data_team read more
  • trunghoang12 trunghoang12

    Domain: https://agents.monkeyenglish.net/
    APIKEY: a813ec766197294184a938c331b08e7g

    Translate Text API

    Endpoint:

    POST /translate

    Description:

    This endpoint translates text from a source language to a target language. It supports both basic and advanced translation, with additional options for context, area, and style in the advanced mode.

    Request Headers:

    Header Type Required Description
    APIKEY String Yes API key for authorization

    Request Body:

    Basic Translation:

    Field Type Required Description
    source_lang String Yes The language of the source text.
    target_lang String Yes The language to translate to.
    sentence String Yes The text to be translated.
    is_advance Boolean No Set to False for basic translation.

    Note: source_lang, target_lang có thể sử dụng tên quốc gia, hoặc code tên quốc gia theo bảng.

    Advanced Translation (with additional optional fields):

    Field Type Required Description
    source_lang String Yes The language of the source text.
    target_lang String Yes The language to translate to.
    sentence String Yes The text to be translated.
    is_advance Boolean Yes Set to True for advanced translation.
    area String No Specify the domain/area for translation (e.g., legal, medical).
    style String No Specify the translation style (e.g., formal, informal).
    context String No Provide additional context for the translation.

    Example Request (Basic Translation):

    {
      "source_lang": "en",
      "target_lang": "es",
      "sentence": "Hello, how are you?",
      "is_advance": false
    }
    

    Example Response

    {
      "message": "success",
      "target": "Humanity is truly terrifying.",
      "audio_target": "",
      "data": {
        "vie": {
          "text": "Nhân loại thực sự đáng sợ.",
          "audio": "https://vnmedia2.monkeyuni.net/App/uploads/productivity/TW4TBoru1K0TAneo7qUc.wav"
        },
        "eng": {
          "text": "Humanity is truly terrifying.",
          "audio": "https://vnmedia2.monkeyuni.net/App/uploads/productivity/h41XWUypOsAOIWUvWvvW.wav"
        }
      }
    }
    

    Supported Languages:

    code language script Source Target
    afr Afrikaans Latn Sp, Tx Tx
    amh Amharic Ethi Sp, Tx Tx
    arb Modern Standard Arabic Arab Sp, Tx Sp, Tx
    ary Moroccan Arabic Arab Sp, Tx Tx
    arz Egyptian Arabic Arab Sp, Tx Tx
    asm Assamese Beng Sp, Tx Tx
    ast Asturian Latn Sp --
    azj North Azerbaijani Latn Sp, Tx Tx
    bel Belarusian Cyrl Sp, Tx Tx
    ben Bengali Beng Sp, Tx Sp, Tx
    bos Bosnian Latn Sp, Tx Tx
    bul Bulgarian Cyrl Sp, Tx Tx
    cat Catalan Latn Sp, Tx Sp, Tx
    ceb Cebuano Latn Sp, Tx Tx
    ces Czech Latn Sp, Tx Sp, Tx
    ckb Central Kurdish Arab Sp, Tx Tx
    cmn Mandarin Chinese Hans Sp, Tx Sp, Tx
    cmn_Hant Mandarin Chinese Hant Sp, Tx Sp, Tx
    cym Welsh Latn Sp, Tx Sp, Tx
    dan Danish Latn Sp, Tx Sp, Tx
    deu German Latn Sp, Tx Sp, Tx
    ell Greek Grek Sp, Tx Tx
    eng English Latn Sp, Tx Sp, Tx
    est Estonian Latn Sp, Tx Sp, Tx
    eus Basque Latn Sp, Tx Tx
    fin Finnish Latn Sp, Tx Sp, Tx
    fra French Latn Sp, Tx Sp, Tx
    fuv Nigerian Fulfulde Latn Sp, Tx Tx
    gaz West Central Oromo Latn Sp, Tx Tx
    gle Irish Latn Sp, Tx Tx
    glg Galician Latn Sp, Tx Tx
    guj Gujarati Gujr Sp, Tx Tx
    heb Hebrew Hebr Sp, Tx Tx
    hin Hindi Deva Sp, Tx Sp, Tx
    hrv Croatian Latn Sp, Tx Tx
    hun Hungarian Latn Sp, Tx Tx
    hye Armenian Armn Sp, Tx Tx
    ibo Igbo Latn Sp, Tx Tx
    ind Indonesian Latn Sp, Tx Sp, Tx
    isl Icelandic Latn Sp, Tx Tx
    ita Italian Latn Sp, Tx Sp, Tx
    jav Javanese Latn Sp, Tx Tx
    jpn Japanese Jpan Sp, Tx Sp, Tx
    kam Kamba Latn Sp --
    kan Kannada Knda Sp, Tx Tx
    kat Georgian Geor Sp, Tx Tx
    kaz Kazakh Cyrl Sp, Tx Tx
    kea Kabuverdianu Latn Sp --
    khk Halh Mongolian Cyrl Sp, Tx Tx
    khm Khmer Khmr Sp, Tx Tx
    kir Kyrgyz Cyrl Sp, Tx Tx
    kor Korean Kore Sp, Tx Sp, Tx
    lao Lao Laoo Sp, Tx Tx
    lit Lithuanian Latn Sp, Tx Tx
    ltz Luxembourgish Latn Sp --
    lug Ganda Latn Sp, Tx Tx
    luo Luo Latn Sp, Tx Tx
    lvs Standard Latvian Latn Sp, Tx Tx
    mai Maithili Deva Sp, Tx Tx
    mal Malayalam Mlym Sp, Tx Tx
    mar Marathi Deva Sp, Tx Tx
    mkd Macedonian Cyrl Sp, Tx Tx
    mlt Maltese Latn Sp, Tx Sp, Tx
    mni Meitei Beng Sp, Tx Tx
    mya Burmese Mymr Sp, Tx Tx
    nld Dutch Latn Sp, Tx Sp, Tx
    nno Norwegian Nynorsk Latn Sp, Tx Tx
    nob Norwegian Bokmål Latn Sp, Tx Tx
    npi Nepali Deva Sp, Tx Tx
    nya Nyanja Latn Sp, Tx Tx
    oci Occitan Latn Sp --
    ory Odia Orya Sp, Tx Tx
    pan Punjabi Guru Sp, Tx Tx
    pbt Southern Pashto Arab Sp, Tx Tx
    pes Western Persian Arab Sp, Tx Sp, Tx
    pol Polish Latn Sp, Tx Sp, Tx
    por Portuguese Latn Sp, Tx Sp, Tx
    ron Romanian Latn Sp, Tx Sp, Tx
    rus Russian Cyrl Sp, Tx Sp, Tx
    slk Slovak Latn Sp, Tx Sp, Tx
    slv Slovenian Latn Sp, Tx Tx
    sna Shona Latn Sp, Tx Tx
    snd Sindhi Arab Sp, Tx Tx
    som Somali Latn Sp, Tx Tx
    spa Spanish Latn Sp, Tx Sp, Tx
    srp Serbian Cyrl Sp, Tx Tx
    swe Swedish Latn Sp, Tx Sp, Tx
    swh Swahili Latn Sp, Tx Sp, Tx
    tam Tamil Taml Sp, Tx Tx
    tel Telugu Telu Sp, Tx Sp, Tx
    tgk Tajik Cyrl Sp, Tx Tx
    tgl Tagalog Latn Sp, Tx Sp, Tx
    tha Thai Thai Sp, Tx Sp, Tx
    tur Turkish Latn Sp, Tx Sp, Tx
    ukr Ukrainian Cyrl Sp, Tx Sp, Tx
    urd Urdu Arab Sp, Tx Sp, Tx
    uzn Northern Uzbek Latn Sp, Tx Sp, Tx
    vie Vietnamese Latn Sp, Tx Sp, Tx
    xho Xhosa Latn Sp --
    yor Yoruba Latn Sp, Tx Tx
    yue Cantonese Hant Sp, Tx Tx
    zlm Colloquial Malay Latn Sp --
    zsm Standard Malay Latn Tx Tx
    zul Zulu Latn Sp, Tx Tx

    Speech Translation API

    Endpoint:

    POST /speech/translate

    Description:

    This endpoint translates an uploaded audio file from a source language to a target language. It supports speech-to-text translation tasks.

    Request Headers:

    Header Type Required Description
    APIKEY String Yes API key for authorization

    Request Body (Form-Data):

    Field Type Required Description
    audio File Yes The audio file to be translated.
    source String Yes The language of the audio (e.g., en for English).
    target String Yes The language to translate the audio to.
    task String No Translation task type. Default is S2TT (Speech-to-Text-to-Translation).

    Example Request (Form-Data):

    Key Value
    audio (upload audio file)
    source en
    target fr
    task S2TT

    Note: Language code follows the above table.
    Task: S2TT if only want to translate to text, S2ST to translate with output audio + text
    target: accept multi-output Example "vie,eng,spa"

    Response:

    Field Type Description
    status String Status of the translation request.
    output String The translated text or processed output.
    error String Error message if applicable.

    Successful Response (200 OK):

    {
      "status": "success",
      "output": "Bonjour",
      "error": ""
    }
    

    Error Response (500 Internal Server Error):

    {
      "status": "failure",
      "output": "",
      "error": "System encountered an unexpected error. <error message>"
    }
    

    Error Handling:

    • 401 Unauthorized: Invalid API key.
    • 500 Internal Server Error: System encountered an unexpected error.

    Audio Streaming Client for Speech-to-Text and Translation (S2TT)

    1. Overview

    This document provides an overview of how to implement a client for streaming audio data to a WebSocket server that processes the data for speech-to-text-to-translation (S2TT) tasks. The system is designed to handle real-time audio streaming from clients, which can be built using various programming languages.

    Key Components:

    • WebSocket Server: The server receives audio data from the client, processes it, and returns results (e.g., transcriptions, translations).
    • Client: Any client application (mobile, desktop, web) can stream audio to the server over WebSocket.
    • Streaming Protocol: Audio data is chunked and transmitted in real-time, with metadata indicating task details such as the source language, target language, and processing task.

    2. Communication Flow

    2.1 Initial Connection

    1. Client Connects to Server: The client establishes a WebSocket connection with the server at a predefined URI.

      • Example WebSocket URI: ws://<server-address>:<port>/ws/translate/<session_id>
      • wss://agents.monkeyenglish.net/ws/translate/123
        session_id: random_string
    2. Task Metadata: The client sends an initial message to define the task. This message includes:

      • Source Language: The language of the audio input (e.g., eng for English).
      • Target Language: The language for translation (e.g., vie for Vietnamese).
      • Task Type: The processing task (e.g., S2TT for Speech-to-Text-to-Translation).

      Message Format (JSON):

      {
          "type": "start",
          "data": {
              "source": "eng",
              "target": "vie",
              "task": "S2TT"
          }
      }
      

    2.2 Streaming Audio Data

    1. Audio Streaming: The client reads and sends audio data in chunks to the server. Each chunk is a segment of the full audio file, mimicking real-time audio streaming.

      • The audio data is converted into a byte stream for transmission.
    2. Transmission Format:

      • Audio chunks are transmitted in binary format (e.g., byte array).
      • Each chunk is sent over the WebSocket connection, followed by a short delay to simulate real-time audio capture.
    3. Streaming Example:

      • For every audio chunk, the client sends the binary data over the established WebSocket connection.
      • The client continues sending chunks until the entire audio file has been transmitted.

    2.3 Task Completion

    1. End of Transmission: After the client finishes sending all audio chunks, it sends a final message to the server indicating that the streaming is complete and the task can be processed.

      Message Format (JSON):

      {
          "type": "do_task",
          "data": {
              "source": "eng",
              "target": "vie",
              "task": "S2TT"
          }
      }
      
    2. Processing Response: The server processes the received audio, performing the requested task (e.g., transcription and translation). Once complete, the server responds with the result, which may include:

      • Transcribed text.
      • Translated text.
    3. Response Format: The server sends a JSON message back to the client containing the task's result:

    {
    "message": "",
    "data": {
    "vie": {
    "text": "thế là sáng hôm sau cái tin tôi về đến cổng còn phải thăm đường đã lan ra khóc sóng",
    "audio": "https://vnmedia2.monkeyuni.net/App/uploads/productivity/8tihbdQvbQPHkcqmDntW.wav"
    },
    "eng": {
    "text": "So the next morning, when I got back to the cage, I had to walk down the street to cry.",
    "audio": "https://vnmedia2.monkeyuni.net/App/uploads/productivity/5Um6tQ1nzT3BfEqOYzx4.wav"
    }
    },
    "status": "success"
    }```

    3. Client Implementation Guidelines

    3.1 Supported Languages

    The client can be developed in any language that supports WebSocket communication, such as:

    • JavaScript: Web-based applications.
    • Python: Server-side or command-line tools.
    • Java/Kotlin: Android applications.
    • Swift: iOS applications.
    • C#: Desktop or .NET applications.

    3.2 WebSocket Library

    Ensure that the client uses a WebSocket library suitable for your chosen programming language. Common libraries include:

    • JavaScript: Native WebSocket API or popular libraries like socket.io.
    • Python: websockets or websocket-client.
    • Java/Kotlin: OkHttp WebSocket implementation.
    • Swift: Starscream library for WebSocket communication.

    3.3 Audio File Handling

    The client needs to handle reading audio files or capturing audio in real-time. The format of the audio must be compatible with the server’s requirements (e.g., 16kHz, mono, .wav ).

    3.4 Chunking and Streaming

    The client should send audio data in small chunks. For real-time applications:

    • Chunk Size: Each chunk should be small enough to allow near real-time transmission, typically between 1-3 seconds of audio data per chunk.
    • Delay: Introduce a small delay (e.g., 1-10 milliseconds) between sending each chunk to simulate real-time streaming.

    3.5 Error Handling

    The client must handle potential errors during the WebSocket communication:

    • Connection Issues: Reconnect if the WebSocket connection is dropped.
    • Server Responses: Handle unexpected responses or errors from the server gracefully.
    • Timeouts: Implement timeouts to prevent hanging connections if no response is received from the server.

    4. Server Configuration

    4.1 Server URI

    Clients must connect to the WebSocket server at the following URI:

    ws://<server-address>:<port>/ws/translate/<session_id>
    
    • <server-address>: IP address or domain of the WebSocket server.
    • <port>: Port on which the server is running (e.g., 5001).
    • <session_id>: A unique identifier for the client session, generated for each streaming session.

    4.2 Audio Processing

    The server is responsible for:

    • Receiving and buffering audio chunks.
    • Processing the audio (speech recognition, translation).
    • Sending results back to the client in the expected format.

    5. Example Use Cases

    5.1 Mobile Voice Translation App

    A mobile app developed in Java or Swift captures the user's voice, streams the audio to the server using WebSocket, and receives the translated text, which is displayed to the user in real-time.

    5.2 Web-Based Audio Translator

    A JavaScript web application allows users to upload audio files. The app streams the audio to the server, processes it, and shows the translation results to the user.

    5.3 Desktop Speech-to-Text Tool

    A Python desktop application records audio from the microphone, streams it to the server, and displays real-time transcription and translation.


    6. Conclusion

    This document provides an overview of the WebSocket-based client-server system for real-time audio streaming and processing. The client can be implemented in any language with WebSocket support, allowing flexible integration across various platforms and applications.


    posted in Data_team read more
  • trunghoang12 trunghoang12

    Recommend

    Tài liệu về nghiên cứu và giải pháp Recomemd System .docx

    posted in Data_team read more
  • trunghoang12 trunghoang12

    AI Converter

    API Tools

    • API Tool provides some API for features: Sync Text, Normalize Audio Mp3
    1. Sync Text for Audio API Documentation

    Endpoint

    URL: https://aitools.monkeyenglish.net/segement

    Method: POST

    Parameters

    Name Type Description Required Example
    audio File (mp3) Uploaded audio file in mp3 format Yes audio=@L2U6 - Chant 4 - Mr. Billy.mp3
    karaoke_format Boolean Option to convert to karaoke format or not Yes true or false
    text String Text used for mapping sync text (Optional) No "Isn't this regular milk?"

    Returns

    Name Type Description
    data Object Output of sync text function
    message String Information about sync text successfully or state mapping
    status Boolean True if the system completes tasks completely, else False
    exception String Empty if completed task, else contains exception message

    Example Request

    curl -X 'POST' \
      'https://aitools.monkeyenglish.net/segement' \
      -H 'accept: application/json' \
      -H 'Content-Type: multipart/form-data' \
      -F 'audio=@L2U6 - Chant 4 - Mr. Billy.mp3;type=audio/mpeg' \
      -F 'karaoke_format=false' \
      -F 'text='
    
    1. API normalize mp3 file

    Endpoint: https://aitools.monkeyenglish.net/normalize-audio`

    Method: POST

    Parameters

    Name Type Description Required Example
    file File (mp3) Uploaded audio file in mp3 format Yes file=@L2U6 - Chant 4 - Mr. Billy.mp3

    Returns

    Name Type Description
    data Object Output of normalized audio file
    message String Information about normalization success or failure
    status Boolean True if the system completes tasks completely, else False
    exception String Empty if completed task, else contains exception message

    Example Request

    curl -X 'POST' \
      'https://aitools.monkeyenglish.net/normalize-audio' \
      -H 'accept: application/json' \
      -H 'Content-Type: multipart/form-data' \
      -F 'file=@L2U6 - Chant 4 - Mr. Billy.mp3;type=audio/mpeg'
    

    Score Audio from CSV

    POST /score_audio
    Description: Processes an uploaded CSV or Excel file and returns a CSV file with processed results.

    Request
    Content-Type: multipart/form-data
    Parameters:
    file (required): An uploaded file in CSV or Excel format. The file should contain the data to be processed.
    name_column_text (required): The name of the column in the file that contains text data.
    name_column_audio (required): The name of the column in the file that contains audio data.
    name_column_output (required): The name of the column where the processed results will be written.

    Convert Audio API

    Base URL

    The base URL for the API is: https://aitools.monkeyenglish.net/

    Endpoints

    1. Convert MP3 to WAV

    Endpoint: POST /mp3-to-wav/

    Description: Converts an MP3 file to a WAV file.

    Request:

    • file: The MP3 file to be converted. (required)

    Response:

    • Returns a WAV file with Content-Disposition header set to attachment; filename={original_filename}.wav.

    2. Convert WAV to MP3

    Endpoint: POST /wav-to-mp3/

    Description: Converts a WAV file to an MP3 file.

    Request:

    • file: The WAV file to be converted. (required)

    Response:

    • Returns an MP3 file with Content-Disposition header set to attachment; filename={original_filename}.mp3.

    3. Normalize MP3

    Endpoint: POST /normalize-mp3/

    Description: Normalizes the volume of an MP3 file.

    Request:

    • file: The MP3 file to be normalized. (required)

    Response:

    • Returns a normalized MP3 file with Content-Disposition header set to attachment; filename={original_filename}_normalized.mp3.

    4. Normalize WAV

    Endpoint: POST /normalize-wav/

    Description: Normalizes the volume of a WAV file and optionally changes the sample rate and number of channels.

    Request:

    • file: The WAV file to be normalized. (required)
    • sample_rate: The sample rate for the output file. (optional, default: 44100)
    • channels: The number of channels for the output file. (optional, default: 2)

    Response:

    • Returns a normalized WAV file with Content-Disposition header set to attachment; filename={original_filename}_normalized.wav.

    5. Convert WAV to Normalized MP3

    Endpoint: POST /wav-normalize-mp3/

    Description: Converts a WAV file to an MP3 file and normalizes the volume.

    Request:

    • file: The WAV file to be converted and normalized. (required)

    Response:

    • Returns a normalized MP3 file with Content-Disposition header set to attachment; filename={original_filename}_normalized.mp3.

    6. Convert MP3 to Custom WAV

    Endpoint: POST /mp3-custom-wav/

    Description: Converts an MP3 file to a WAV file with custom sample rate and number of channels.

    Request:

    • file: The MP3 file to be converted. (required)
    • sample_rate: The sample rate for the output file. (optional, default: 44100)
    • channels: The number of channels for the output file. (optional, default: 2)

    Response:

    • Returns a WAV file with Content-Disposition header set to attachment; filename={original_filename}_normalized.wav.

    CLI Tools

    1. CLI Usage Document for _sync_lip_sync.py
    Overview

    The _sync_lip_sync.py script is designed to synchronize lip sync data from a source directory to a destination directory. This script is executed using Python 3 and takes two arguments: the source directory and the destination directory.

    Prerequisites
    • Ensure that Python 3 is installed on your system. You can download it from python.org.
    • Make sure the _sync_lip_sync.py script is in your working directory or provide the full path to the script.
    • Ensure you have the necessary permissions to read from the source directory and write to the destination directory.
    Usage

    To run the _sync_lip_sync.py script, open your terminal or command prompt and execute the following command:

    python3 _sync_lip_sync.py <source_directory> <destination_directory>
    
    1. CLI Tool for Scoring Audio
    Overview

    This CLI tool is designed to score audio files based on transcripts provided in a CSV file. It downloads audio files from URLs specified in the CSV, processes them, and outputs the results to a specified CSV file.

    Usage

    To use the CLI tool, run the following command in your terminal:

    ```bash
    python3 _scoring_audio_competition.py arg1 arg2 arg3 arg4
    ```

    Arguments
    • arg1: Input filename (CSV)

      • The path to the input CSV file containing the transcripts and URLs.
    • arg2: Column text

      • The name of the column in the CSV file that contains the transcripts.
    • arg3: Column URL

      • The name of the column in the CSV file that contains the URLs of the audio files to be downloaded.
    • arg4: Output filename (CSV)

      • The path to the output CSV file where the scoring results will be saved.
    Detailed Instructions
    1. Prepare Input CSV File

      • Create a CSV file with at least two columns: one for the transcripts and one for the URLs of the audio files.
      • Ensure the column names match the values you will provide for arg2 and arg3.

      Example:
      ```csv
      transcript,url
      "This is a sample transcript","http://example.com/audio1.mp3"
      "Another transcript","http://example.com/audio2.mp3"
      ```

    2. Run the Tool

      • Open your terminal.
      • Navigate to the directory where _scoring_audio_competition.py is located.
      • Run the following command, replacing input.csv, transcript, url, and output.csv with your actual file and column names:

      ```bash
      python3 _scoring_audio_competition.py input.csv transcript url output.csv
      ```

    3. Output

      • The tool will download the audio files from the URLs specified in the input CSV.
      • It will process the audio files and score them based on the provided transcripts.
      • The results will be saved in the output CSV file specified.
    Example Command

    Here is an example command based on the provided instructions:

    ```bash
    python3 _scoring_audio_competition.py example_input.csv transcript url example_output.csv
    ```

    In this example:

    • example_input.csv is the input CSV file containing the transcripts and audio URLs.
    • transcript is the name of the column containing the transcripts.
    • url is the name of the column containing the audio URLs.
    • example_output.csv is the output CSV file where the scoring results will be saved.
    Notes
    • Ensure you have all necessary dependencies installed before running the script.
    • The input CSV file should be well-formatted to avoid any errors during processing.
    • The output CSV file will contain the results of the scoring process, which can be used for further analysis.

    posted in Data_team read more
  • trunghoang12 trunghoang12

    • Install KIC on K8S

    Kong Ingress Controller allows you to run Kong Gateway as a Kubernetes Ingress to handle inbound requests for a Kubernetes cluster.

    f6f9bd7b-86f9-45d9-a25c-832281203f6b-image.png

    Installation KIC on K8S: [link](https://docs.konghq.com/kubernetes-ingress-controller/latest/get-started/

    • Add a Service into KONG

    posted in Data_team read more
  • trunghoang12 trunghoang12

    Tài liệu kĩ thuật và tích hợp AI M-Speak Dialogue

    1. Sơ đồ tổng quan các thành phần hệ thống

    Mspeak Dialog - Serice Dialogye.png

    • Core Dialogue: Chịu trách nhiệm xử lý các logic nghiệp vụ của dialogue, và giao tiếp với các thành phần khác trong hệ thống. Là phương thức để app kết nối thông qua GRPC
    • ELS: Logging và monitoring các thông số khi được xử lý
    • Triton Serving: Là một service thực hiện inference các model AI phục vụ trong quá trình dialogue xảy ra.
    • OpenAI: Third party phục vụ trong mục đích generate phản hồi cho người dùng.
    • Postgres: Database chính của Dialogue
    • Redis: Caching
    1. Các modules chính trong Core Dialogue BE

    Mspeak Dialog - Page 8.png

    Trong Core BE Dialog gồm có 5 modules chính:

    2.1 Manager

    • Bao gồm các class tương tác với database thực hiện các thao tác: select, update, insert, delete tương ứng với từng bảng.
      -> Toàn bộ được sử dụng thông qua factory_cls.py (Factory design patterns)
      Chi tiết các bảng tương ứng các class tuân theo design database bên dưới.

    2.2 Modeling

    • Gồm các class hỗ trợ suy luận các mô hình AI extend triton_client -> được sử dụng thông qua factory_ai.py
    • Các function đã được define sẵn sử dụng với OpenAI.

    2.3 Pipeline:

    Các pipeline cho dialogue gồm 3 pipeline chính:

    • General pipeline: Toàn bộ câu trả lời người dùng sẽ đi qua pipeline này
    • Response pipeline: Tuy theo các case người dùng sẽ được điều hướng vào 1 trong 3 pipeline tuỳ theo dạng câu hỏi người dùng đang trả lời là: Yes/No question, Media question hay Openning question.
    • Answer pipeline: Generate câu phản hồi cho người dùng cùng với action tiếp theo của người dùng cần thực hiện.
      Sử dụng thông qua factory_pipeline.py

    2.4 Resouce

    • Tập hợp các file bao gồm file config, các file yaml lưu trữ default các logic và câu trả lời mẫu.
    • File proto của service

    2.5 Grpc_libs:

    • Define toàn bộ logic in/out của service khi kết nối với core.
    1. Thiết kế database

    ER Diagram Description

    tbl_conversation

    Column Name Description
    id Primary key.
    conversation_name Name of the conversation.
    description Textual description of the conversation.
    level Level of the conversation.
    voice The type of voice used.
    type Conversation type (format/style).
    num_tries Number of tries allowed.
    greeting Greeting message/phrase.
    greeting_media_url URL to media file for the greeting.
    ending Ending message/phrase.
    ending_media_url URL to media file for the ending.
    bundle_path Path to the conversation's bundled files.
    zip_path Path to the zip file containing conversation data.
    created_at Record creation timestamp.
    updated_at Last record update timestamp.

    tbl_question

    Column Name Description
    id Primary key.
    conversation_id Foreign key (links to tbl_conversation).
    question The question text.
    media_url URL to media file for the question.
    index The order/index of the question in conversation.
    attribute_extend Additional attributes/metadata.
    created_at Record creation timestamp.
    updated_at Last record update timestamp.
    intent_condition Conditions for intent applicability.

    tbl_intent

    Column Name Description
    id Primary key.
    question_id Foreign key (links to tbl_question).
    intent The intent associated with the question.
    response Response text for the intent.
    media_url URL to media file for the response.
    retrial Number of retry attempts allowed.
    created_at Record creation timestamp.
    updated_at Last record update timestamp.

    Relationships:

    • tbl_conversation: Has a one-to-many relationship with tbl_question via conversation_id.
    • tbl_question: Has a one-to-many relationship with tbl_intent via question_id.
    • Các chức năng chính trong service bao gồm:

    UC1: Tạo mới một cuộc hội thoại
    UC2: Tạo mới cuộc hội thoại bằng file (updating...)
    UC3: Lấy danh sách các cuộc hội thoại
    UC4: Điều khiển hội thoại (Answer Comunication Talk)

    Để tạo mới một cuộc hội thoại cần chuẩn bị tên của hội thoại -> tương tứng với tên chủ đề.
    Mỗi hội thoại bao gồm tập các câu hỏi và câu trả lời khác nhau.
    Mỗi câu hỏi chỉ thuộc một trong 3 loại: Yes/No question, Media question, Opening question

    Flow tích hợp dialogue
    Sequence diagram.png

    Postman document: Link

    Bước 0: Khi người dùng chọn vào topic -> video introduction sẽ được play lên
    B1: Khi hết intro video -> app gửi 1 request command start
    1fda8b2a-f3f9-421e-a91c-7ad8b447db52-image.png

    Bước 2: Sau khi start thì hệ thống xác nhận một cuộc hội thoại mới và trả về một số thông tin

    • session: thông tin về phiên làm việc của user, client quan tâm về session_id
    • question: thông tin câu hỏi hiện tại của user
    • reply: sẽ là câu phản hồi lại câu trả lời của người dùng. Nhưng do start thì câu đầu tiên nên người dùng sẽ chưa trả lời gì, vì vậy nên sẽ là câu hỏi bắt đầu luôn

    Bước 3: Người dùng trả lời xong câu hỏi
    5d63cdcb-6368-421c-aaba-5eac6178fd5b-image.png

    7a02d5e9-e2d8-4c8e-a35e-8cd50837cd60-image.png

    Bước 4: Lặp lại đến khi hết các câu
    Trong session data nếu thấy "is_last": true, có nghĩa đó là câu cuối rồi. Thì sẽ không có câu hỏi nữa. Thì sẽ play video ending về.

    WebSocket Client Usage Guide for Speech-Text

    This document explains how to use a WebSocket client to send JSON data followed by audio data to a WebSocket server.

    Prerequisites

    Ensure you have the following prerequisites:

    • A WebSocket server running and accessible.
    • An audio file in WAV format (sample_rate = 16000, channels = 1)

    Steps to Use the WebSocket Client

    1. Prepare the Audio File:

      • Ensure you have an audio file in WAV format that you want to send to the WebSocket server. (sample rate 16000, channels = 1)
    2. Connect to the WebSocket Server:

      • Establish a connection to the WebSocket server using the server's URI (Uniform Resource Identifier). The URI typically includes the protocol (ws or wss), the server's address, and the endpoint for the WebSocket connection. For example:
        ws://localhost:8000/ws/{device_id}
        

    Domain dev: wss://lipsync.monkeyenglish.net/ws/{device_id}
    Domain live: wss://videocall.monkeyenglish.net/ws/{device_id}
    Note: device_id is a unique string, can use profile Id, or others

    1. Send Audio Data:

      • Read the audio data from the WAV file and send it as binary data to the WebSocket server. Make sure the audio data is sent after the JSON message.
    2. Send JSON Data:

      • Send a JSON message containing the context of the speech. The JSON data should have a key named context and the value should be the text describing the context. For example:
        {
          "context": "This is a test context for speech-to-text conversion."
        }
        
      • Ensure the JSON message is sent first before sending the audio data.
      • Context is question, topic type or anything to scale scope of audio input -> prefer inserting question
    3. Receive the Response:

      • Wait for the WebSocket server to process the data and send a response. The response could be the result of the speech-to-text conversion or any other relevant information.
        Example response:
        {"text": "They are very beautiful"}
    4. Handle Disconnection:

      • Be prepared to handle disconnections from the WebSocket server gracefully. Ensure that any resources or connections are properly closed.

    Example Workflow

    1. Connect to the WebSocket server at ws://localhost:8000/ws/{device_id}.
    2. Send JSON data with the key context and a relevant value.
    3. Send audio data from a WAV file.
    4. Receive and process the server's response.
    5. Close the connection gracefully.

    Notes

    • Ensure that the audio data is in the correct format (e.g., float32) before sending it.
    • If an error occurs during the process, handle it appropriately and attempt to reconnect if necessary.
    • The WebSocket server should be configured to handle both JSON and binary data correctly.

    API Docs chi tiết các API trong service

    AI-Module.docx

    posted in Data_team read more
  • trunghoang12 trunghoang12

    This document provides an instruction to integrate the Monkey's model about handwriting to edge devices.

    Tech Stack

    • TensorFlow Lite
    • Unity

    Introduction

    Handwriting is an AI model, which was converted to TensorFlow lite format supported by Google for multi-platforms.
    The model can be downloaded at here

    1. Hyperparameters

    Some configurations of the model:

    Name config Description Value
    Input shape Shape of input: It is an image converted to size width x height x channel 128 x 128 x 3 (BGR) 1 x 128 x 128 x 3
    Output It is a score array to predict each class. To get the prediction label, we will choose a max value in the output array 1 x 26

    The mapping index to the prediction character will be shown below.

    {"0": "a", "1": "b", "2": "c", "3": "d", "4": "e", "5": "f", "6": "g", "7": "h", "8": "i", "9": "j", "10": "k", "11": "l", "12": "m", "13": "n", "14": "o", "15": "p", "16": "q", "17": "r", "18": "s", "19": "t", "20": "u", "21": "v", "22": "w", "23": "x", "24": "y", "25": "z"}
    

    Example: The model predicts for an image with output:

    [[7.8083836e-03 1.0330592e-02 3.6540066e-04 1.1240702e-01 1.2986563e-01
      8.1596321e-05 2.7041843e-03 1.8760953e-02 1.5376755e-03 8.4590465e-05
      1.9241240e-02 2.4502007e-02 2.1457224e-01 1.2494331e-02 1.9096583e-02
      2.9417273e-04 2.1153286e-02 1.8904490e-02 6.2950579e-03 3.8062898e-03
      1.2752166e-01 2.5853007e-03 1.6490310e-01 3.3960843e-03 2.3415815e-03
      7.4946553e-02]]
    

    We can see that index 22 is the maximum value (index starts from 0) with a confidence score of 16%. Mapping to label, it is character 'v'.

    3. Pre-install for model

    The model requires an image with a shape of 128 x 128 x 3
    The model could recognize exactly. The image should be handled by logic and requirements

    • Text: black
    • Background: white
    • Object: center of image
    • Size of object = 50 - 60 % of the image.
    • Resize image (128 x 128 x 3)
    • Value type: Fp32

    """
    def pre_process(self, img):

        # Cropping 
        mask = img != 255
        mask = mask.any(2)
        mask0,mask1 = mask.any(0),mask.any(1)
        colstart, colend = mask0.argmax(), len(mask0)-mask0[::-1].argmax()+1
        rowstart, rowend = mask1.argmax(), len(mask1)-mask1[::-1].argmax()+1
        img = img[rowstart:rowend, colstart:colend]
        img_h, img_w = img.shape[0], img.shape[1]
    
        # Padding 
        # Create white image
        img_size = img_w if img_w > img_h else img_h
        img_size = int(1.9 * img_size) 
        new_img = np.zeros([img_size , img_size , 3], dtype=np.uint8)
        new_img.fill(255)
        
        # insert text image into white image
        start_w = int((new_img.shape[1] - img_w ) /2)
        start_h = int((new_img.shape[0] - img_h ) /2)
        new_img[start_h : start_h + img_h,start_w : start_w + img_w ,:] = img[:,:,:]
        
        # thicken text.
        iterations = int(img_size / 128)
        if iterations > 1 :
            
            kernel = np.ones((5, 5), np.uint8)
            new_img = cv2.erode(new_img, kernel, iterations=iterations)
            
        # resize image
        new_img = cv2.resize(new_img, (self.size, self.size), interpolation = cv2.INTER_AREA)
        new_img = np.array(new_img, dtype=np.float32)
        return new_img
    

    """

    Example input:

    img.png

    2. Installation

    3. Usage API from Server

    Live:https://app.monkeyenglish.net/mspeak/handwriting
    Dev: https://ai.monkeyenglish.net/handwriting

    Field Description
    URL
    Method POST
    Header 'APIKEY': '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'
    Body {"image" : string base 64 of an image}, "pre_process": true , "target" : ""
    Response {
    "status": true,
    "text": [
        {
            "character": "z",
            "conf_score": 100.0
        },
        {
            "character": "o",
            "conf_score": 100.0
        },
        {
            "character": "c",
            "conf_score": 100.0
        }
    ],
    "msg": ""
    

    } |

    • Note:
    • pre_process: The system will pre-process the image to enhance the performance of the model
    • target: "a" - text wants to compare with predict (nullable). You can push anything
    • In response: text - is a prediction of the image, conf is a confident score (0-100%)

    posted in Data_team read more
  • trunghoang12 trunghoang12

    Triton Serving for AI Model

    In this repo at the branch "triton_serving", we will provide sources and guides for production model AI through TensorRt and Triton Serving.

    Install requirements

    Triton supports some platforms to bring your AI model to production with high performance, such as Torch, TensorFlow, TensorRT, Onnx, and Pure Python.

    In some cases, if you want to run your code with pure Python, some third-party libraries are required. You should create a custom base image.

    Creating a base image

    I attached a Dockerfile to build a custom base image with requirements.txt.

    To build a base image, please insert your libraries into requirements.txt. Don't forget to define the version.

       docker build. -t <image_name>:<image_tag>
    

    Note: You can change the image name and image tag in <image_name>:<image_tag>.

    Converting model

    You can use any framework to develop your model, such as TensorFlow, Python, etc. But the pure framework is quite slow in production. So I strongly recommend converting to another format, such as ONNX or TensorRT.

    You can use any framework to develop your model, such as TensorFlow, Python, etc. But the pure framework is quite slow in production. So I strongly recommend converting to another format, such as ONNX or TensorRT.
    While converting, you can set fp16 mode or int8 to speed up inference time. But let's remember that you need to check again about the precision of the model after it was converted.

    Two common cases:

    • ONNX with fp16 or fp32

    • TensorRT: fp16, fp32 or int8

    1. Installation

    If you want to install the TensorRT environment on your local machine, you can follow the instructions or documents

    You can face some issues when installing on a local machine. You can check again about the version.
    Another way is that it is easy to use and rapid to set up. Docker is a wonderful solution to address any problems related to installation.

       docker run -d  --gpus all -it --rm -v ./:/workspace  nvcr.io/nvidia/tensorrt:23.09-py3
    

    To finish setting up Triton with Docker, follow the command below. It would be best if you mapped your workspace in the host machine into Docker workspace by argument -v.

    The NVIDIA team was exposed to a stage for converting the model from deep learning frameworks to inference frameworks.

    image

    2. Converting to ONNX

    The entire model should be converted to ONNX before being transferred to TensorRT. You can follow 2 instructions below to convert your model.

    • Tensorflow, pls use this guide

    • Torch with this guide

    • Others: Google is always ready for use.

    The Monkey's ONNX model was saved at S3: data-team-media-ai/model-zoo/onnx_model_zoo/

    2. Converting to TensorRT

    I used Docker to convert my model to TensorRT, you can refer my command below:

       trtexec --onnx=models/openai/whisper-small/onnx/whisper-small-encoder.onnx --saveEngine='model.engine' --explicitBatch --workspace=1024
    
    • ONNX: path of the model ONNX

    • saveEngine: path of TensorRT model

    • explicitBatch: This option will allow for a fixed batch size.

    • workspace: the value allows to set maximum memory for each layer in the model

    If you want to run fp16, or int8 add an argument into the command as:

       trtexec --onnx=onnx/question_statement/torch-model.onnx --saveEngine='model.engine' --explicitBatch --workspace=1024 --fp16
    

    If you want to set a dynamic axis for the TensorRT model:

    trtexec --onnx=onnx/sentence_transformer/all-mpnet-base-v2.onnx --saveEngine='model.engine' --minShapes=input_ids:1x1,attention_mask:1x1 --optShapes=input_ids:1x15,attention_mask:1x15 --maxShapes=input_ids:1x384,attention_mask:1x384
    

    We can export any name you want, but to identify the model, it is a TensorRT model or not? We should set the file extension to one in three [.plan, .trt, .engine]. But Triton only can see .plan file.

    Serving Triton

    After converting the model to TensorRT format, We can bring them to our production through Triton Serving.

    Some steps to apply them to products:

    • Create a model_repository

    This folder will be used to covert entire your model.

    model_repository
    |
    +-- handwriting
        |
        +-- config.pbtxt
        +-- 1
            |
            +-- model.onnx
    
    • Define the model config inside config.pbtxt:
    name: "handwriting"
    platform: "onnxruntime_onnx"
    max_batch_size : 32
    
    input [
      {
        name: "input_1"
        data_type: TYPE_FP32
        format: FORMAT_NHWC
        dims: [ 128, 128, 3 ]
        reshape { shape: [128, 128, 3 ] }
      }
    ]
    output [
      {
        name: "dense_1"
        data_type: TYPE_FP32
        dims: [ 26 ]
        reshape { shape: [ 26] }
        label_filename: "labels.txt"
      }
    ]
    
    

    name: model name, it is the name of the folder too.
    platform: env to run your model [onnxruntime_onnx, tensorrt_plan, torch, ...]
    max_batch_size: the maximum batch size of the model
    input: define the input of API
    output: define the structure of the response

    In a model_repository, you can define many sub-folders; it is equivalent to a model.

    After converting the model, don't forget to upload it to S3:

    aws  s3  sync  model_repository   s3://data-team-media-ai/model-zoo/triton/
    
    • Serving
    docker run -d --gpus all  --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v/home/mautrung/edu_ai_triton_serving/model_repository:/models nvcr.io/nvidia/tritonserver:23.09-py3-custom tritonserver --model-repository=/models
    

    Note: Some requirements about the device to run Triton

    Testing

    Triton provides both protocols: GRPC (8001) and HTTP (8000).

    Postman documents

    Benchmark API from Triton

    We can benchmark the model, that was started up by Triton by Apache Benchmark tool.

    ab -p data_samples/body_bm_wav2vec/bm_w2v.txt -T application/json  -c 100 -n 1000 http://localhost:8000/v2/models/wav2vec_trt/infer
    
    • data_samples/body_bm_wav2vec/bm_w2v.txt: The file contains the body of the request. It has JSON format but it is saved into a .txt file.
    • -c: concurrency requests.
    • -n: number of requests will be used.

    Some sample data in a folder: data_samples

    Result:

    Result will be show

    posted in Data_team read more
  • trunghoang12 trunghoang12

    Tài liệu về back end web testing AI

    Trong dự án BE sẽ cung cấp các API xử lý các luồng logic cho web AI tại link:
    Nhiệm vụ của BE sẽ gồm:

    • Lưu trữ, cập nhật các thông tin người tham gia, các trạng thái của người tham gia
    • Đề xuất các câu hỏi và xử lý logic đề thi theo từng level
    • Kết nối với Mspeak Service để chấm điểm
    • Kết nối với Kenisis để đồng bộ dữ liệu hoạt động với luồng của Report Service
      Kenisis -> Lamda -> DynamoDB

    Một số các thông tin cấu hình cho Report Service như sau:

    • Kenisis: dev: ai_report, live: ai_report_production
    • Lamda Function: dev: build_ai_testing_report , live: ai_webtest_production
    • DynamDB Table: dev: ai_testing_report , live: ai_testing_report_production
    • Database name: dev: edu_platform_6, live: edu_platform

    Các tài liệu kĩ thuật của dự án theo dõi dưới đây:

    • Tài liệu thiết kế database tại đây
    • Tài liệu mô tả các API tại đây
    • Tài liệu luồng hoạt động [tại đây](link url)
    • Github code: tại đây
    • Giải thích ý nghĩa các bảng

    posted in Data_team read more