Now It Talks The Talk Too

“IT’S ALIVE! IT’S ALIVE!!”

“Unable to endure the aspect of the being I had created, I rushed out of the room and continued a long time traversing my bed-chamber, unable to compose my mind to sleep.” - Frankenstein by Mary Shelly

Alright so it wasn’t quite that dramatic but the sentiment still stands. Last night was the event horizon except I need eyes where I’m going. Damn that’s a good film, you should go watch it if you haven’t seen it before. The breach point was confirmed shortly after midnight when the gameplay loop ran fully WITH the voice output and the avatar movement. It was surreal to see it finally all click together after debugging trace backs and rewriting the LLM logic.

What’s New?

Voice Pipeline

Since the previous devlog entry I have now integrated the voice pipeline and the avatar integration. That means those responses that were coming out the LLM before are now not just being returned to the action logging and execution functions within the qud_gameplay module, they are also routed straight into the index-tts loop where they are modulated and returned as audio playback using ffplay. I added a function into the indextts class that uses the existing split sentences to pass out the audio in generated chunks instead of waiting until the full inference loop has been completed and the whole of the response audio is generated. Modulation currently uses a 7 second sample voice so there will always be that 7 second latency as it maps the waveforms against the sample to output, but, by doing it with this method it has the potential to reduce the perceived latency of the response dramatically. The benefit compounds as the length of the response grows so it’s really valuable to have in place for a more rapid response that is just an emulation of speaking while thinking.

Fixing that voice latency issue was a big win for the response times but doesn’t do anything to address the length of time in the LLM generating the response. Well we can do the same process with the LLMRunner - another benefit of using a direct approach with something like llama.cpp over Ollama in my particular circumstance as it provides the control to be able to make these changes. The LLMRunner now creates an ASync sub-process that pipes out the stdout and stderr to a specific buffer size, the output of which can then be pulled into both the voice pipeline and executor.py. This mirrors the chunked voice output process of buffered reads streamed progressively to reduce wait time, especially beneficial at a larger response size.

There may well be other more elegant ways of doing what I have described above, and the biggest save would probably be downgrading from the current quantized 13B parameter model for something more friendly on my GPU. I’ll wait to see how it feels with extended play first.

Avatar Integration

I have also now been able to get the avatar integration set up so that the audio playback mentioned above also triggers a mouth movement on a Live2D model within VTube Studio. Normally this software package would use a webcam or a mobile phone camera to track the user’s face to then map it against the model rig. There’s a snap below showing how it looks within Live2D - they have a wealth of information at the Live2D Docs .

The cupboards were bare

You’re essentially mapping the model’s mesh vectors to your facial motion vectors so when you raise your eyebrows the model raises its eyebrows. This project, though, is using an AI so there is no face to map (no, I’m not sitting there in silence controlling the face), so how does the mouth move? VTube Studio allows for the creation of plugins and custom parameters, so once I authorised the plugin in Vtube Studio it allowed the mouth movements to be synced via a websocket API. The same principle would apply to other expressions or natural body movements such as the body moving along the y axis with breathing … which is probably the strangest way I have ever thought about breathing. If you wanted to see the rigging process in action then I would recommend this BriAtCookieBox video, she does the rigging on her own model and is shit-hot at it to put it bluntly.

I will hasten to add here that I have not decided that I would also take up Live2D and design and rig a model to use. All I have done is buy the model files and a non-exclusive right to use them from Aka Huhu of GALAXY project.

There’s still fine-tuning required to simulate phoneme movement more naturally during playback … plus a few little extras I’m considering. All of that is for another day though. With these additions, how does the flowchart look now???

graph TD
    subgraph Game_Caves_of_Qud
        qud_engine["Qud Engine"]
        log_output["message_log.txt"]
        status_json["status JSONs: hp, status, zone, inventory"]
    end

    subgraph AI_Runtime
        watcher["perception.py - Log Watcher"]
        buffer["event_buffer - context memory"]
        prompt_builder["prompt_builder.py - build_prompts()"]
        runner["llama_runner.py - LLMRunner"]
        action_parser["action_parser.py - parse_action()"]
        inventory_parser["inventory_parser.py - parse_inventory_action()"]
        executor["executor.py - decide_and_act()"]
        tts["infer.py - stream_infer()"]
        avatar_sync["vtube_sync.py - VTubeStudioClient"]
    end

    subgraph LLM_Model
        model["llama-run: *chosen model*"]
    end

    qud_engine -->|Appends events| log_output
    qud_engine -->|Exports state| status_json
    log_output --> watcher
    status_json --> buffer
    watcher --> buffer
    buffer --> prompt_builder
    prompt_builder --> runner
    runner --> model
    model --> runner
    runner --> action_parser
    runner --> inventory_parser
    action_parser --> executor
    inventory_parser --> executor
    executor -->|Inject command| qud_engine

    %% New additions
    runner -->|LLM output| tts
    tts -->|Audio stream| avatar_sync
    tts -->|Play via ffplay| qud_audio["Audio Output"]

It’s Tricky Tricky Tricky Trrrrrricky

It was not smooth sailing though, and I had already expected that to be the case. Effectively trying to tie together 4 different elements:

- Gameplay loop
- LLM function
- Voice pipeline
- Avatar integration

This is not a scenario where AI assisted coding shines, from my experience, as the LLM is not “remembering” the content of files that have been generated or code that has been implemented and is working. Realistically it could be engineered to be a more pleasant experience by first mapping out and understanding the exact requirements for each integration step, and then crafting a prompt and providing the context of existing code which would then provide the focus solely into what is already in place … I wasn’t prepared to put that level of effort in on this occasion, which in hindsight was pretty dumb.

To give an example of the impact this had, there is a specific file prompt_builder.py which handles the building of the prompt for the LLM by consolidating log entries as a context type and then building around a system_prompt, instruction_prompt and user_prompt. It’s a masterful piece of engineering really … (it’s not, it’s a simple thing but keeps functions more isolated and I love isolation) … Well this prompt_builder.py wasn’t mentioned enough clearly when I started going through the process of chopping up the LLM responses and chunking the outputs to the voice; because I got to the end to test and realised that my beautifully crafted prompt that had been generating some good responses was now throwing out a stream of syntax which, while sounding lovely, was not optimal. The changes had bypassed the prompt_builder.py because the instruction was to solve a problem without explicitly giving clear architecture constraints or context.

I got it working so it was a midnight annoyance more than anything but it was still an annoyance I probably didn’t need to deal with but chose to. If anything it is just a test of my increasing ability to be able to break something and then debug it and attempt a fix before resorting to mindless AI reliance. I don’t believe reliance on the tool should ever be an end-goal with the vibes, it’s a really good Swiss army knife but you don’t use one of them if you’re facing down a ferocious bear in the woods. This implementation phase was a good reminder that it’s easy to get carried away with the vibe sometime but fundamentally everything should be backed by a clarity of purpose and a solid understanding of the architecture.

#qudlocked

What’s next then? There’s some fine tuning to do before we can truly be #qudlocked but majority implementation is complete for phase 1.

Stay tuned.

piestyx

“IT’S ALIVE! IT’S ALIVE!!”#

What’s New?#

Voice Pipeline#

Avatar Integration#

It’s Tricky Tricky Tricky Trrrrrricky#

#qudlocked#