https://civitai.com/models/827184?modelVersionId=1761560 ! 2022-11-01 https://www.instagram.com www.instagram.com##.x1uhb9sk ! Title: Hide YouTube Shorts ! Description: Hide all traces of YouTube shorts videos on YouTube ! Version: 1.8.0 ! Last modified: 2023-01-08 20:02 ! Expires: 2 weeks (update frequency) ! Homepage: https://github.com/gijsdev/ublock-hide-yt-shorts ! License: https://github.com/gijsdev/ublock-hide-yt-shorts/blob/master/LICENSE.md ! Hide all videos containing the phrase "#shorts" youtube.com##ytd-grid-video-renderer:has(#video-title:has-text(#shorts)) youtube.com##ytd-grid-video-renderer:has(#video-title:has-text(#Shorts)) youtube.com##ytd-grid-video-renderer:has(#video-title:has-text(#short)) youtube.com##ytd-grid-video-renderer:has(#video-title:has-text(#Short)) ! Hide all videos with the shorts indicator on the thumbnail youtube.com##ytd-grid-video-renderer:has([overlay-style="SHORTS"]) youtube.com##ytd-rich-item-renderer:has([overlay-style="SHORTS"]) youtube.com##ytd-video-renderer:has([overlay-style="SHORTS"]) youtube.com##ytd-item-section-renderer.ytd-section-list-renderer[page-subtype="subscriptions"]:has(ytd-video-renderer:has([overlay-style="SHORTS"])) ! Hide shorts button in sidebar youtube.com##ytd-guide-entry-renderer:has-text(Shorts) youtube.com##ytd-mini-guide-entry-renderer:has-text(Shorts) ! Hide shorts section on homepage youtube.com##ytd-rich-section-renderer:has(#rich-shelf-header:has-text(Shorts)) youtube.com##ytd-reel-shelf-renderer:has(.ytd-reel-shelf-renderer:has-text(Shorts)) ! Hide shorts tab on channel pages ! Old style youtube.com##tp-yt-paper-tab:has(.tp-yt-paper-tab:has-text(Shorts)) ! New style (2023-10) youtube.com##yt-tab-shape:has-text(/^Shorts$/) ! Hide shorts in video descriptions youtube.com##ytd-reel-shelf-renderer.ytd-structured-description-content-renderer:has-text("Shorts remixing this video") ! Remove empty spaces in grid youtube.com##ytd-rich-grid-row,#contents.ytd-rich-grid-row:style(display: contents !important) englishtorrent.*###overlay ! 28-Sept-2024 https://nanoreview.net nanoreview.net##.banner ||www.reddit.com/r/popular/^$document ! ||www.reddit.com/?feed=popular^$document feed=popular*$all Are there any end to end voice models, which contain STT, an LLM, and a TTS together? Yes, there are end-to-end (E2E) voice models that integrate speech-to-text (STT), reasoning (LLM), and text-to-speech (TTS) into a single neural network . These are often referred to as Speech-to-Speech (S2S) or Native Multimodal models. Leading End-to-End Voice Models Unlike traditional "cascaded" systems that chain three separate models together, these native models process audio input directly to generate audio output. GPT-4o (OpenAI): A single model trained end-to-end across text, vision, and audio. It can respond to audio inputs in as little as 232ms, preserving emotional nuances and prosody that are typically lost in cascaded systems. Qwen3-TTS/ASR (Alibaba/NVIDIA): A native multimodal model using a "Thinker-Talker" architecture. It is optimized for low-latency, edge-first voice interaction, handling both reasoning and speech synthesis on-chip. Ultravox: An open-weight model that replaces the traditional STT/ASR part of the pipeline with a direct speech-to-speech setup. It allows the model to "reason" directly on audio tokens rather than transcribed text. Moshi (Kyutai): An experimental E2E model capable of real-time, low-latency spoken conversations. Comparison: End-to-End vs. Cascaded Pipelines Feature End-to-End (Native S2S) Cascaded (STT → LLM → TTS) Architecture Single neural network Multiple independent models Latency Ultra-low (approx. 200–300ms) Medium to High (sequential processing) Prosody/Emotion Preserved from original input Often lost during text conversion Control Harder to debug (black box) Highly customizable at each stage Reasoning Improving, but can be "dumber" Superior for complex logic/tools Emerging Hybrid Solutions Some platforms now offer "unified" infrastructures that co-locate separate STT, LLM, and TTS models on a single cloud to mimic E2E performance. For example, Together AI provides a co-located stack (using Deepgram and Cartesia) to eliminate network hops and achieve sub-700ms latency without being a single "black box" model. Speech-to-Speech vs Cascaded Voice AI: Which Architecture ... 28 Jan 2026 — What Is Speech-to-Speech (S2S) Voice AI? Speech-to-speech (S2S) voice AI is an architecture where a single model processes audio i... Coval.dev GPT-4o: The Cutting-Edge Advancement in Multimodal LLM 13 May 2024 — GPT-4o builds upon the architecture of its predecessors, integrating enhancements in context window size, tokenization efficiency, TechRxiv Multilingual Voice AI in India: STT–LLM–TTS Pipeline vs ... 7 Jan 2026 — Table of Contents * Approach 1: STT → LLM → TTS Pipeline (Modular, Accurate, Customizable) * Approach 2: Speech-to-Speech (S2S) Mo... Subverse AI Show all Are any of these models open weight/source? Several of the end-to-end (E2E) voice models mentioned are open weight, meaning you can download the model parameters and run them on your own hardware. However, their "openness" varies between fully open-source and research-only licenses. Open-Weight End-to-End Models These models combine listening (ASR/STT) and talking (TTS) into a single multimodal framework. Ultravox (Fixie.ai): Fully open weight and one of the most active projects. It "couples" an audio encoder directly to a language model (like Llama 3.3 or GLM-4). The weights for various versions (8B to 70B+) are available on Hugging Face. Moshi (Kyutai): A native speech-to-speech model released with open weights. It uses a 7B "Helium" backbone and the "Mimi" codec to handle audio and text streams simultaneously. Its weights are available under a CC BY 4.0 license. Qwen2-Audio / Qwen3-TTS (Alibaba): Alibaba frequently releases open-weight versions of their multimodal models. Qwen2-Audio is a large-scale audio-language model that can "listen" and reason, while newer iterations like Qwen3-TTS allow for expressive voice design and are available under the Apache 2.0 license. GLM-4-Voice (Zhipu AI): An end-to-end voice model that supports real-time, emotional human-computer interaction. It is often integrated into open-weight pipelines like Ultravox v0.7. Closed-Source Models (Proprietary) GPT-4o (OpenAI): This model is proprietary. You can only access its end-to-end voice capabilities through the OpenAI API or the ChatGPT app. While OpenAI has recently released some "open-weight" models (like GPT-OSS), their flagship multimodal E2E voice tech remains closed. Summary Table: Open vs. Closed Model Open Weights? License Type Primary Use Case Ultravox Yes Apache 2.0 / MIT Real-time enterprise voice agents Moshi Yes CC BY 4.0 Low-latency research & interaction Qwen3-TTS Yes Apache 2.0 Multilingual voice cloning & design GPT-4o No Proprietary (API) General purpose high-intelligence AI GLM-4-Voice Yes Research-friendly Emotional, conversational reasoning fixie-ai/ultravox: A fast multimodal LLM for real-time voice 18 Aug 2025 — About. Ultravox is a new kind of multimodal LLM that can understand text as well as human speech, without the need for a separate ... GitHub GPT-4o Model | OpenAI API Search the API docs ... GPT-4o (“o” for “omni”) is our versatile, high-intelligence flagship model. It accepts both text and image... OpenAI Introducing Ultravox v0.7, the world's smartest speech ... 4 Dec 2025 — We're proud to share that Ultravox v0. 7 delivers on both of those without sacrificing the speed, performance, and conversational ... Ultravox.ai Show all Are there any projects that can run with 12GB vram and 32 GB system RAM? Yes, with 12GB VRAM and 32GB system RAM, you can run several capable end-to-end (speech-to-speech) models locally, provided you use quantized versions (compressed weights). Here are the specific projects and configurations that fit your hardware: 1. GLM-4-Voice-9B (Int4 Quantized) This is likely your best bet for a high-quality, "smart" conversationalist that fits comfortably. Hardware Fit: The INT4 quantized version is explicitly optimized to run on 12GB VRAM GPUs (like the RTX 3060/4070). What it does: It handles understanding, reasoning, and speech generation in one model. It can follow instructions to change its tone, speed, or emotion. Where to get it: Look for the GLM-4-Voice-9B-Int4 version on Hugging Face. 2. Moshi (Moshi-Mistral-7B) Moshi is a native speech-to-speech model designed for low latency. Hardware Fit: The full FP16 model requires ~14-16GB VRAM, but 4-bit (Q4) or 8-bit (Q8) quantized versions fit easily into 12GB. 4-bit: Uses ~5–6 GB VRAM (leaving plenty of room for long context). 8-bit: Uses ~8–9 GB VRAM (tighter fit, but doable). What it does: It listens and speaks simultaneously (duplex), handling interruptions and emotional speech very well. Where to get it: Moshi Hugging Face Collection. You will likely need to use a backend like llama.cpp or a specialized loader that supports quantization for multimodal models. 3. Ultravox (v0.4 / v0.5) Ultravox is often an "Audio-LLM" (Audio-In, Text-Out) that pairs with a TTS, but newer experimental versions and pipelines are pushing for end-to-end capabilities. Hardware Fit: The models are typically based on Llama-3-8B or Mistral-7B. 8B Model (4-bit): Fits comfortably in 12GB VRAM. Note: Unlike Moshi or GLM-4-Voice, standard Ultravox implementations often focus heavily on the input (understanding audio directly) and may still rely on a lightweight discrete TTS for the output in some setups, though the experience feels instantaneous. 4. Qwen2-Audio-7B (Audio-In Only) + Qwen3-TTS While not a single "fused" model, this specific pair is optimized to run together on consumer hardware. Configuration: Qwen2-Audio-7B (Int4): ~5GB VRAM. Qwen3-TTS: ~2GB VRAM. Total: ~7GB VRAM (Fits easily in your 12GB limit). Why use this: It separates "thinking/listening" from "speaking," which gives you higher quality voice control (using Qwen3-TTS's voice cloning) compared to the potentially "robotic" artifacts of some fully fused models. Summary Recommendation For the "Real" End-to-End Experience: Download GLM-4-Voice-9B (Int4). It is the most cohesive "single brain" experience that fits your card. For Low-Latency Chat: Try Moshi (quantized). It feels the most "alive" due to its duplex nature. [Release] Qwen3-TTS: Ultra-Low Latency (97ms), Voice Cloning & ... 24 Jan 2026 — [Release] Qwen3-TTS: Ultra-Low Latency (97ms), Voice Cloning & OpenAI-Compatible API. ... Hi everyone, The Qwen team just dropped ... Reddit QwenLM | Qwen3-TTS - Kaggle 22 Jan 2026 — Qwen3-TTS covers multiple dialectal voice profile and feature strong contextual understanding, enabling adaptive control of tone, ... Kaggle Qwen 2.5 7B VRAM Tips Every Dev Should Know | by Novita AI - Medium 16 Jul 2025 — Key Highlights * FP16 inference for Qwen 2.5–7B requires ~17 GB VRAM, while FP32 needs over 32 GB — making full-precision setups f... Medium Show all In their repo "While we hope that the present codebase will work on Windows, we do not provide official support for it. We have tested the MLX version on a MacBook Pro M3. At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB)." But kyutai/moshiko-pytorch-q8 does exist. What does this mean? It means there is a gap between the project's official documentation and what the community (or the devs themselves) has actually released since the initial launch. Here is the breakdown of why that contradiction exists and what it means for you: 1. The Repo is likely Outdated When Moshi first launched, they only supported "Full Precision" (BF16) for the PyTorch version, which requires ~24GB VRAM. The README was written at that time. Since then, they (or the community) have successfully quantized the model into the Q8 (8-bit) format you found. 2. "Official Support" vs. "It Exists" The disclaimer about Windows and 24GB VRAM is a "cover our backs" statement. Official Support: If you have an issue on Windows or with a quantized model, the developers might not help you troubleshoot it. The Reality: The moshiko-pytorch-q8 model was created specifically to lower the VRAM requirement. 8-bit quantization typically cuts the VRAM usage in half. 3. Will it work for you? Yes, but with caveats: VRAM: The Q8 version of Moshi (7B parameters) should use roughly 8GB to 10GB of VRAM. This fits into your 12GB card, but it’s tight. You won't have much room for other GPU-heavy apps (like a web browser with hardware acceleration or a game) running in the background. Windows: Since they don't "officially" support Windows, you might run into issues with specific dependencies (like uvloop or certain audio drivers). Pro Tip: Use WSL2 (Windows Subsystem for Linux). It allows you to run the Linux version of the code on your Windows machine while still using your GPU. It is much more stable for these types of AI projects. The Bottom Line Ignore the "24GB" warning in the README if you are using the Q8 version. That warning applies to the uncompressed model. With your 12GB VRAM, the Q8 version is exactly what you should be trying. If Q8 still feels sluggish or hits your VRAM limit, look for a Q4 version (which would only use ~5GB VRAM), though audio quality/reasoning might drop slightly. Could this Qwen3-TTS run on the NPU in Ultra 9 275HX? Yes, Qwen3-TTS can run on the NPU of the Intel Core Ultra 9 275HX using the OpenVINO toolkit. Intel provides "Day 0" support for the Qwen3 family, including specific optimizations for the NPU and integrated Xe GPU to handle high-speed audio generation. Running Qwen3-TTS on your Hardware The Core Ultra 9 275HX (Arrow Lake) features an Intel AI Boost NPU capable of roughly 13 TOPS. For Qwen3-TTS specifically: Model Compatibility: Both the 0.6B and 1.7B parameter versions of Qwen3-TTS are designed for "edge devices" like yours. Performance: Using the NPU with OpenVINO allows for ultra-low-latency streaming, with first-audio-packet delivery as fast as 97ms. Resource Efficiency: Since the NPU handles the inference, your dedicated GPU (e.g., RTX 50-series often paired with this CPU) is freed up for other tasks like gaming or rendering. How to Deploy To run it on the NPU, you typically use the OpenVINO Notebooks implementation: Convert: Export the model components (Talker, Embedding, Code Predictor) to OpenVINO's .xml IR format. Select Device: In your inference script or the OpenVINO pipeline, set the target device to NPU. Quantize: While 8-bit quantization is standard for Qwen3, OpenVINO allows for INT4 or INT8 optimizations to further reduce the load on your NPU. If you prefer a simpler setup, tools like Pinocchio can automate the installation, though they often default to the GPU for faster initial setup. Qwen3-TTS Text-to-Speech with OpenVINO - GitHub 12 Feb 2026 — Qwen3-TTS Text-to-Speech with OpenVINO™ * Qwen3-TTS Text-to-Speech with OpenVINO™ Qwen3-TTS is a state-of-the-art text-to-speech m... GitHub OpenVINO Release Notes 23 Feb 2026 — OneTBB library in OpenVINO™ Windows release has been upgraded from 2021.2.1 to 2021.13.1. Linux docker support for offline cores o... OpenVINO™ documentation Qwen3-TTS Family is Now Open Sourced: Voice Design ... 21 Jan 2026 — Qwen3-TTS is a series of powerful speech generation capabilities developed by Qwen, offering comprehensive support for voice clone... Qwen AI Show all While that's running there can I run Qwen2-Audio-7B (Int4) on the 5070 Ti? Or is it better to run both on the same GPU? Yes, you can absolutely run Qwen2-Audio-7B (Int4) on your RTX 5070 Ti while Qwen3-TTS runs on your NPU . In fact, separating them is the superior strategy for performance. Why Split the Workload? No Resource Fighting: If you put both on the GPU, they would compete for VRAM and Compute cycles. Your RTX 5070 Ti has 16GB VRAM, which is plenty, but running them simultaneously on one device could cause "hiccups" in the audio stream if the GPU gets busy processing a complex user prompt. Parallel Processing: The NPU is essentially "free real estate" for the TTS. It can generate the speech while the GPU is already starting to think about the next part of the conversation or processing the next audio input. Lower Latency: By offloading the TTS to the NPU (which is optimized for streaming small packets), you avoid the overhead of moving data back and forth to the GPU for every tiny audio chunk. Proposed Configuration This "Hybrid" setup maximizes your hardware's unique strengths: Component Model Device VRAM/RAM Usage Role Listening & Thinking Qwen2-Audio-7B (Int4) RTX 5070 Ti (GPU) ~5-6 GB VRAM Listens to your voice, understands context, and generates the text response. Speaking Qwen3-TTS Intel NPU Shared System RAM Converts the text response from the GPU into audio instantly. Feasibility Check GPU Headroom: The RTX 5070 Ti has 16GB VRAM. The Qwen2-Audio-7B (Int4) model only needs about 5-6GB. You have ~10GB of VRAM left over for other tasks (gaming, browser, OS) without any slowdowns. NPU Capability: The Core Ultra 9's NPU is specifically validated by Intel to run Qwen3-TTS efficiently. How to set this up (Conceptual) You will likely need a small "orchestrator" script (Python) to glue them together: Input: Microphone audio goes to GPU (Qwen2-Audio). Process: GPU generates text tokens (e.g., "Hello there!"). Bridge: As soon as the text tokens appear, stream them to the NPU. Output: NPU (via OpenVINO) synthesizes audio and plays it out. This setup basically mimics the architecture of a professional cloud voice agent but runs entirely offline on your desk. These technical guides explain how to run Qwen2-Audio-7B on an RTX 5070 Ti and leverage Intel NPUs for Qwen3 models: Intel® AI Solutions Accelerate Qwen3 Large Language Models May 6, 2025 — Summary. In conclusion, Intel AI PCs support Qwen3 models at launch, and Intel is the only company to provide Day 0 support for ru... Intel Sergei6000/Qwen2-Audio-7B-Instruct-Int4 - Hugging Face Jul 15, 2024 — Introduction. Qwen2-Audio is the new series of Qwen large audio-language models. Qwen2-Audio is capable of accepting various audio... Hugging Face Qwen/Qwen2-7B · [AUTOMATED] Model Memory Requirements Jul 19, 2024 — Model Memory Requirements. You will need about {'dtype': 'float16/bfloat16', 'Largest Layer or Residual Group': '1.02 GB', 'Total ... Hugging Face