https://civitai.com/models/827184?modelVersionId=1761560


! 2022-11-01 https://www.instagram.com
www.instagram.com##.x1uhb9sk

! Title: Hide YouTube Shorts
! Description: Hide all traces of YouTube shorts videos on YouTube
! Version: 1.8.0
! Last modified: 2023-01-08 20:02
! Expires: 2 weeks (update frequency)
! Homepage: https://github.com/gijsdev/ublock-hide-yt-shorts
! License: https://github.com/gijsdev/ublock-hide-yt-shorts/blob/master/LICENSE.md

! Hide all videos containing the phrase "#shorts"
youtube.com##ytd-grid-video-renderer:has(#video-title:has-text(#shorts))
youtube.com##ytd-grid-video-renderer:has(#video-title:has-text(#Shorts))
youtube.com##ytd-grid-video-renderer:has(#video-title:has-text(#short))
youtube.com##ytd-grid-video-renderer:has(#video-title:has-text(#Short))

! Hide all videos with the shorts indicator on the thumbnail
youtube.com##ytd-grid-video-renderer:has([overlay-style="SHORTS"])
youtube.com##ytd-rich-item-renderer:has([overlay-style="SHORTS"])
youtube.com##ytd-video-renderer:has([overlay-style="SHORTS"])
youtube.com##ytd-item-section-renderer.ytd-section-list-renderer[page-subtype="subscriptions"]:has(ytd-video-renderer:has([overlay-style="SHORTS"]))

! Hide shorts button in sidebar
youtube.com##ytd-guide-entry-renderer:has-text(Shorts)
youtube.com##ytd-mini-guide-entry-renderer:has-text(Shorts)

! Hide shorts section on homepage
youtube.com##ytd-rich-section-renderer:has(#rich-shelf-header:has-text(Shorts))
youtube.com##ytd-reel-shelf-renderer:has(.ytd-reel-shelf-renderer:has-text(Shorts))

! Hide shorts tab on channel pages
! Old style
youtube.com##tp-yt-paper-tab:has(.tp-yt-paper-tab:has-text(Shorts))
! New style (2023-10)
youtube.com##yt-tab-shape:has-text(/^Shorts$/)

! Hide shorts in video descriptions
youtube.com##ytd-reel-shelf-renderer.ytd-structured-description-content-renderer:has-text("Shorts remixing this video")

! Remove empty spaces in grid
youtube.com##ytd-rich-grid-row,#contents.ytd-rich-grid-row:style(display: contents !important)


englishtorrent.*###overlay

! 28-Sept-2024 https://nanoreview.net
nanoreview.net##.banner

||www.reddit.com/r/popular/^$document 
! ||www.reddit.com/?feed=popular^$document
feed=popular*$all


Are there any end to end voice models, which contain STT, an LLM, and a TTS together?
Yes, there are end-to-end (E2E) voice models that integrate speech-to-text (STT), reasoning (LLM), and text-to-speech (TTS) into a single neural network
. These are often referred to as Speech-to-Speech (S2S) or Native Multimodal models. 
Leading End-to-End Voice Models
Unlike traditional "cascaded" systems that chain three separate models together, these native models process audio input directly to generate audio output. 

    GPT-4o (OpenAI): A single model trained end-to-end across text, vision, and audio. It can respond to audio inputs in as little as 232ms, preserving emotional nuances and prosody that are typically lost in cascaded systems.
    Qwen3-TTS/ASR (Alibaba/NVIDIA): A native multimodal model using a "Thinker-Talker" architecture. It is optimized for low-latency, edge-first voice interaction, handling both reasoning and speech synthesis on-chip.
    Ultravox: An open-weight model that replaces the traditional STT/ASR part of the pipeline with a direct speech-to-speech setup. It allows the model to "reason" directly on audio tokens rather than transcribed text.
    Moshi (Kyutai): An experimental E2E model capable of real-time, low-latency spoken conversations. 

Comparison: End-to-End vs. Cascaded Pipelines
Feature 	End-to-End (Native S2S)	Cascaded (STT → LLM → TTS)
Architecture	Single neural network	Multiple independent models
Latency	Ultra-low (approx. 200–300ms)	Medium to High (sequential processing)
Prosody/Emotion	Preserved from original input	Often lost during text conversion
Control	Harder to debug (black box)	Highly customizable at each stage
Reasoning	Improving, but can be "dumber"	Superior for complex logic/tools
Emerging Hybrid Solutions
Some platforms now offer "unified" infrastructures that co-locate separate STT, LLM, and TTS models on a single cloud to mimic E2E performance. For example, Together AI provides a co-located stack (using Deepgram and Cartesia) to eliminate network hops and achieve sub-700ms latency without being a single "black box" model. 

    Speech-to-Speech vs Cascaded Voice AI: Which Architecture ...
    28 Jan 2026 — What Is Speech-to-Speech (S2S) Voice AI? Speech-to-speech (S2S) voice AI is an architecture where a single model processes audio i...
    Coval.dev
    GPT-4o: The Cutting-Edge Advancement in Multimodal LLM
    13 May 2024 — GPT-4o builds upon the architecture of its predecessors, integrating enhancements in context window size, tokenization efficiency,
    TechRxiv
    Multilingual Voice AI in India: STT–LLM–TTS Pipeline vs ...
    7 Jan 2026 — Table of Contents * Approach 1: STT → LLM → TTS Pipeline (Modular, Accurate, Customizable) * Approach 2: Speech-to-Speech (S2S) Mo...
    Subverse AI

Show all
Are any of these models open weight/source?
Several of the end-to-end (E2E) voice models mentioned are
open weight, meaning you can download the model parameters and run them on your own hardware. However, their "openness" varies between fully open-source and research-only licenses. 
Open-Weight End-to-End Models
These models combine listening (ASR/STT) and talking (TTS) into a single multimodal framework.

    Ultravox (Fixie.ai): Fully open weight and one of the most active projects. It "couples" an audio encoder directly to a language model (like Llama 3.3 or GLM-4). The weights for various versions (8B to 70B+) are available on Hugging Face.
    Moshi (Kyutai): A native speech-to-speech model released with open weights. It uses a 7B "Helium" backbone and the "Mimi" codec to handle audio and text streams simultaneously. Its weights are available under a CC BY 4.0 license.
    Qwen2-Audio / Qwen3-TTS (Alibaba): Alibaba frequently releases open-weight versions of their multimodal models. Qwen2-Audio is a large-scale audio-language model that can "listen" and reason, while newer iterations like Qwen3-TTS allow for expressive voice design and are available under the Apache 2.0 license.
    GLM-4-Voice (Zhipu AI): An end-to-end voice model that supports real-time, emotional human-computer interaction. It is often integrated into open-weight pipelines like Ultravox v0.7. 

Closed-Source Models (Proprietary)

    GPT-4o (OpenAI): This model is proprietary. You can only access its end-to-end voice capabilities through the OpenAI API or the ChatGPT app. While OpenAI has recently released some "open-weight" models (like GPT-OSS), their flagship multimodal E2E voice tech remains closed. 

Summary Table: Open vs. Closed
Model 	Open Weights?	License Type	Primary Use Case
Ultravox	Yes	Apache 2.0 / MIT	Real-time enterprise voice agents
Moshi	Yes	CC BY 4.0	Low-latency research & interaction
Qwen3-TTS	Yes	Apache 2.0	Multilingual voice cloning & design
GPT-4o	No	Proprietary (API)	General purpose high-intelligence AI
GLM-4-Voice	Yes	Research-friendly	Emotional, conversational reasoning

    fixie-ai/ultravox: A fast multimodal LLM for real-time voice
    18 Aug 2025 — About. Ultravox is a new kind of multimodal LLM that can understand text as well as human speech, without the need for a separate ...
    GitHub
    GPT-4o Model | OpenAI API
    Search the API docs ... GPT-4o (“o” for “omni”) is our versatile, high-intelligence flagship model. It accepts both text and image...
    OpenAI
    Introducing Ultravox v0.7, the world's smartest speech ...
    4 Dec 2025 — We're proud to share that Ultravox v0. 7 delivers on both of those without sacrificing the speed, performance, and conversational ...
    Ultravox.ai

Show all
Are there any projects that can run with 12GB vram and 32 GB system RAM?
Yes, with
12GB VRAM and 32GB system RAM, you can run several capable end-to-end (speech-to-speech) models locally, provided you use quantized versions (compressed weights). 
Here are the specific projects and configurations that fit your hardware:
1. GLM-4-Voice-9B (Int4 Quantized)
This is likely your best bet for a high-quality, "smart" conversationalist that fits comfortably.

    Hardware Fit: The INT4 quantized version is explicitly optimized to run on 12GB VRAM GPUs (like the RTX 3060/4070).
    What it does: It handles understanding, reasoning, and speech generation in one model. It can follow instructions to change its tone, speed, or emotion.
    Where to get it: Look for the GLM-4-Voice-9B-Int4 version on Hugging Face. 

2. Moshi (Moshi-Mistral-7B)
Moshi is a native speech-to-speech model designed for low latency.

    Hardware Fit: The full FP16 model requires ~14-16GB VRAM, but 4-bit (Q4) or 8-bit (Q8) quantized versions fit easily into 12GB.
        4-bit: Uses ~5–6 GB VRAM (leaving plenty of room for long context).
        8-bit: Uses ~8–9 GB VRAM (tighter fit, but doable).
    What it does: It listens and speaks simultaneously (duplex), handling interruptions and emotional speech very well.
    Where to get it: Moshi Hugging Face Collection. You will likely need to use a backend like llama.cpp or a specialized loader that supports quantization for multimodal models. 

3. Ultravox (v0.4 / v0.5)
Ultravox is often an "Audio-LLM" (Audio-In, Text-Out) that pairs with a TTS, but newer experimental versions and pipelines are pushing for end-to-end capabilities. 

    Hardware Fit: The models are typically based on Llama-3-8B or Mistral-7B.
        8B Model (4-bit): Fits comfortably in 12GB VRAM.
    Note: Unlike Moshi or GLM-4-Voice, standard Ultravox implementations often focus heavily on the input (understanding audio directly) and may still rely on a lightweight discrete TTS for the output in some setups, though the experience feels instantaneous. 

4. Qwen2-Audio-7B (Audio-In Only) + Qwen3-TTS
While not a single "fused" model, this specific pair is optimized to run together on consumer hardware.

    Configuration:
        Qwen2-Audio-7B (Int4): ~5GB VRAM.
        Qwen3-TTS: ~2GB VRAM.
        Total: ~7GB VRAM (Fits easily in your 12GB limit).
    Why use this: It separates "thinking/listening" from "speaking," which gives you higher quality voice control (using Qwen3-TTS's voice cloning) compared to the potentially "robotic" artifacts of some fully fused models. 

Summary Recommendation

    For the "Real" End-to-End Experience: Download GLM-4-Voice-9B (Int4). It is the most cohesive "single brain" experience that fits your card.
    For Low-Latency Chat: Try Moshi (quantized). It feels the most "alive" due to its duplex nature.

    [Release] Qwen3-TTS: Ultra-Low Latency (97ms), Voice Cloning & ...
    24 Jan 2026 — [Release] Qwen3-TTS: Ultra-Low Latency (97ms), Voice Cloning & OpenAI-Compatible API. ... Hi everyone, The Qwen team just dropped ...
    Reddit
    QwenLM | Qwen3-TTS - Kaggle
    22 Jan 2026 — Qwen3-TTS covers multiple dialectal voice profile and feature strong contextual understanding, enabling adaptive control of tone, ...
    Kaggle
    Qwen 2.5 7B VRAM Tips Every Dev Should Know | by Novita AI - Medium
    16 Jul 2025 — Key Highlights * FP16 inference for Qwen 2.5–7B requires ~17 GB VRAM, while FP32 needs over 32 GB — making full-precision setups f...
    Medium

Show all
In their repo "While we hope that the present codebase will work on Windows, we do not provide official support for it. We have tested the MLX version on a MacBook Pro M3. At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB)."

But kyutai/moshiko-pytorch-q8 does exist.
What does this mean?
It means there is a gap between the project's
official documentation and what the community (or the devs themselves) has actually released since the initial launch.
Here is the breakdown of why that contradiction exists and what it means for you:
1. The Repo is likely Outdated
When Moshi first launched, they only supported "Full Precision" (BF16) for the PyTorch version, which requires ~24GB VRAM. The README was written at that time. Since then, they (or the community) have successfully quantized the model into the Q8 (8-bit) format you found.
2. "Official Support" vs. "It Exists"
The disclaimer about Windows and 24GB VRAM is a "cover our backs" statement.

    Official Support: If you have an issue on Windows or with a quantized model, the developers might not help you troubleshoot it.
    The Reality: The moshiko-pytorch-q8 model was created specifically to lower the VRAM requirement. 8-bit quantization typically cuts the VRAM usage in half.

3. Will it work for you?
Yes, but with caveats:

    VRAM: The Q8 version of Moshi (7B parameters) should use roughly 8GB to 10GB of VRAM. This fits into your 12GB card, but it’s tight. You won't have much room for other GPU-heavy apps (like a web browser with hardware acceleration or a game) running in the background.
    Windows: Since they don't "officially" support Windows, you might run into issues with specific dependencies (like uvloop or certain audio drivers).
        Pro Tip: Use WSL2 (Windows Subsystem for Linux). It allows you to run the Linux version of the code on your Windows machine while still using your GPU. It is much more stable for these types of AI projects.

The Bottom Line
Ignore the "24GB" warning in the README if you are using the Q8 version. That warning applies to the uncompressed model. With your 12GB VRAM, the Q8 version is exactly what you should be trying.
If Q8 still feels sluggish or hits your VRAM limit, look for a Q4 version (which would only use ~5GB VRAM), though audio quality/reasoning might drop slightly.
Could this Qwen3-TTS run on the NPU in Ultra 9 275HX?
Yes,
Qwen3-TTS
can run on the NPU of the Intel Core Ultra 9 275HX
using the OpenVINO toolkit. 
Intel provides "Day 0" support for the Qwen3 family, including specific optimizations for the NPU and integrated Xe GPU to handle high-speed audio generation. 
Running Qwen3-TTS on your Hardware
The Core Ultra 9 275HX
(Arrow Lake) features an Intel AI Boost NPU capable of roughly 13 TOPS. For Qwen3-TTS
specifically: 

    Model Compatibility: Both the 0.6B and 1.7B parameter versions of Qwen3-TTS
    are designed for "edge devices" like yours.
    Performance: Using the NPU with OpenVINO allows for ultra-low-latency streaming, with first-audio-packet delivery as fast as 97ms.
    Resource Efficiency: Since the NPU handles the inference, your dedicated GPU (e.g., RTX 50-series often paired with this CPU) is freed up for other tasks like gaming or rendering. 

How to Deploy
To run it on the NPU, you typically use the OpenVINO Notebooks implementation: 

    Convert: Export the model components (Talker, Embedding, Code Predictor) to OpenVINO's .xml IR format.
    Select Device: In your inference script or the OpenVINO pipeline, set the target device to NPU.
    Quantize: While 8-bit quantization is standard for Qwen3, OpenVINO allows for INT4 or INT8 optimizations to further reduce the load on your NPU. 

If you prefer a simpler setup, tools like Pinocchio can automate the installation, though they often default to the GPU for faster initial setup. 

    Qwen3-TTS Text-to-Speech with OpenVINO - GitHub
    12 Feb 2026 — Qwen3-TTS Text-to-Speech with OpenVINO™ * Qwen3-TTS Text-to-Speech with OpenVINO™ Qwen3-TTS is a state-of-the-art text-to-speech m...
    GitHub
    OpenVINO Release Notes
    23 Feb 2026 — OneTBB library in OpenVINO™ Windows release has been upgraded from 2021.2.1 to 2021.13.1. Linux docker support for offline cores o...
    OpenVINO™ documentation
    Qwen3-TTS Family is Now Open Sourced: Voice Design ...
    21 Jan 2026 — Qwen3-TTS is a series of powerful speech generation capabilities developed by Qwen, offering comprehensive support for voice clone...
    Qwen AI

Show all
While that's running there can I run Qwen2-Audio-7B (Int4) on the 5070 Ti? Or is it better to run both on the same GPU?
Yes, you can absolutely run
Qwen2-Audio-7B (Int4) on your RTX 5070 Ti
while Qwen3-TTS runs on your NPU
.
In fact, separating them is the superior strategy for performance.
Why Split the Workload?

    No Resource Fighting: If you put both on the GPU, they would compete for VRAM and Compute cycles. Your RTX 5070 Ti
    has 16GB VRAM, which is plenty, but running them simultaneously on one device could cause "hiccups" in the audio stream if the GPU gets busy processing a complex user prompt.
    Parallel Processing: The NPU is essentially "free real estate" for the TTS. It can generate the speech while the GPU is already starting to think about the next part of the conversation or processing the next audio input.
    Lower Latency: By offloading the TTS to the NPU (which is optimized for streaming small packets), you avoid the overhead of moving data back and forth to the GPU for every tiny audio chunk. 

Proposed Configuration
This "Hybrid" setup maximizes your hardware's unique strengths:
Component	Model	Device	VRAM/RAM Usage	Role
Listening & Thinking	Qwen2-Audio-7B (Int4)	RTX 5070 Ti (GPU)
	~5-6 GB VRAM	Listens to your voice, understands context, and generates the text response.
Speaking	Qwen3-TTS	Intel NPU	Shared System RAM	Converts the text response from the GPU into audio instantly.
Feasibility Check

    GPU Headroom: The RTX 5070 Ti
    has 16GB VRAM. The Qwen2-Audio-7B (Int4) model only needs about 5-6GB. You have ~10GB of VRAM left over for other tasks (gaming, browser, OS) without any slowdowns.
    NPU Capability: The Core Ultra 9's NPU is specifically validated by Intel to run Qwen3-TTS efficiently. 

How to set this up (Conceptual)
You will likely need a small "orchestrator" script (Python) to glue them together:

    Input: Microphone audio goes to GPU (Qwen2-Audio).
    Process: GPU generates text tokens (e.g., "Hello there!").
    Bridge: As soon as the text tokens appear, stream them to the NPU.
    Output: NPU (via OpenVINO) synthesizes audio and plays it out.

This setup basically mimics the architecture of a professional cloud voice agent but runs entirely offline on your desk.
These technical guides explain how to run Qwen2-Audio-7B on an RTX 5070 Ti and leverage Intel NPUs for Qwen3 models:
Intel® AI Solutions Accelerate Qwen3 Large Language Models
May 6, 2025 — Summary. In conclusion, Intel AI PCs support Qwen3 models at launch, and Intel is the only company to provide Day 0 support for ru...
Intel
Sergei6000/Qwen2-Audio-7B-Instruct-Int4 - Hugging Face
Jul 15, 2024 — Introduction. Qwen2-Audio is the new series of Qwen large audio-language models. Qwen2-Audio is capable of accepting various audio...
Hugging Face
Qwen/Qwen2-7B · [AUTOMATED] Model Memory Requirements
Jul 19, 2024 — Model Memory Requirements. You will need about {'dtype': 'float16/bfloat16', 'Largest Layer or Residual Group': '1.02 GB', 'Total ...
Hugging Face