Updated my web-ai-toolkit library yesterday to use the new v3 version of transformers.js from Hugging Face ! This turns on GPU support for speech to text and summarization, using the native support in transformers.js! https://lnkd.in/gA76zdXP
Justin Willis’ Post
More Relevant Posts
-
I am working on a system that is Hybrid Graph and Vector RAG very much like this https://lnkd.in/ehzkT9mK (I started it before I read the paper but the paper (Blackrock and NVIDIA) was interesting and was helpful in some ways) if anyone is working on these type of systems and want to talk tech and compare notes without any use case detail or information let me know would love to Tango.
To view or add a comment, sign in
-
While I wasn't looking, Ollama gained support for Llama-3.2-Vision, 4-bit quants are available for 11B and 90B models. Inference with the 11B model takes 30s on M3 Pro Mac, 40s on Nvidia P40, 15-20s on Nvidia A10G, and 16-25s on Nvidia L4. Bedrock takes 8s for Claude-3.5-Sonnet and 5s for Claude-3-Haiku. Cost per image with A10G (g5.xlarge with 3-year reservation) is 0.18¢, Claude-3.5 Sonnet 0.48¢, Claude-3.5 Haiku 0.16¢, Claude-3 Haiku 0.04¢. The cheapest Haiku model is broadly comparable to Llama-3.2-Vision but significantly cheaper, and is a managed service.
To view or add a comment, sign in
-
Text-Embeddings-Inference is an open source framework for embedding and reranker inference. It is made by Hugging Face and is focused on low latency, high throughput serving with short boot times due to a small docker image.
To view or add a comment, sign in
-
A week ago, George Aristides and I developed a small Competitive Model Comparison (CMC) 🚰 pipeline 🚰 for various sklearn's machine learning models, streamlining the process significantly. This pipeline reduced the lines of code required to obtain basic model scores from several hundred to just about 3 lines. It also incorporates hyperparameter tuning with k-fold CV and visualizations, such as trees for Decision Trees. We have plans to enhance the pipeline even further by adding features like automatic dataset balancing and boosting SVM training speeds. In the meantime, this tool has enabled us to efficiently work on multiple large datasets, conduct exploratory data analysis, and train 8 classic-ml models in just minutes! On a fun note, our continuous pipeline work led George to humorously name this creation 'plumber'... 😆 Check out our collaborative project here: https://lnkd.in/gE-UzWX6 #machinelearning #ml #ai #training #data_analysis #datasets
To view or add a comment, sign in
-
#LLMSys For LLM serving, a homogeneous setting may not be cost-effective. The paper "Efficient and Economic Large Language Model Inference with Attention Offloading" (https://lnkd.in/ed3aRDu2) shows that combining two different GPUs and separating attention/linear calculations (as they have different memory/compute requirement) actually achieves higher throughput per dollar. (I also wondered about serving a language model by combining a 3090 and a much cheaper P40 at home😺)
To view or add a comment, sign in
-
Implementation of FastSAM by Ultralytics using LiteRT (formerly TensorFlow Lite). The inference time is as low as 300ms per camera frame with GPU usage, with some additional time required for post-processing. For more details on the project, including links to the GitHub repository, check out this: https://lnkd.in/ddUBRqXF #AISprint #litert #googlefordevelopers
To view or add a comment, sign in
-
Pixtral 12B in short: Natively multimodal, trained with interleaved image and text data Strong performance on multimodal tasks, excels in instruction following Maintains state-of-the-art performance on text-only benchmarks Architecture: New 400M parameter vision encoder trained from scratch 12B parameter multimodal decoder based on Mistral Nemo Supports variable image sizes and aspect ratios Supports multiple images in the long context window of 128k tokens Use: License: Apache 2.0 Try it on La Plateforme or on Le Chat
To view or add a comment, sign in
-
Understanding Flash Attention: Writing the Algorithm from Scratch in Triton Find out how Flash Attention works. Afterward, we’ll refine our understanding by writing a GPU kernel of the algorithm in Triton.Continue reading on Towards Data Science »... https://lnkd.in/dBMeQJiP #AI #ML #Automation
To view or add a comment, sign in
-
Optimizing Parallel Sorting with CUDA: Bitonic Sort in Action! 🚀 Just played around with a Bitonic Sort algorithm using CUDA. Sometimes you gotta take a break from the usual and get back to the nitty-gritty. And as a tech founder who loves C++ & CUDA, this was right up my alley. We're talking high-speed parallel sorting for HUGE datasets! 🔥 So, what is Bitonic Sort? 🤔 It's a parallel sorting algorithm designed to efficiently sort elements on GPUs. CUDA allows us to harness the full power of GPU cores, making it perfect for large-scale computations where speed is critical. In this implementation, we use CUDA kernels to process multiple elements simultaneously, enabling fast sorting by leveraging warp-level parallelism and shared memory. What’s the use case? 🤔 From high-frequency trading to scientific simulations, wherever large data arrays need sorting quickly, Bitonic Sort can significantly improve performance. Whether handling real-time data streams or crunching large-scale computational tasks, it scales efficiently with hardware. In addition, this code has applications in Machine Learning, crucial for tasks like K-Nearest Neighbors, Decision Trees, and Gradient Boosting. In Deep Learning, sorting aids Quantization and Data Preprocessing. Computer Vision also benefits, using sorting in Image Processing and Object Recognition also NLP models often require efficient processing of large text corpora. Bitonic sort could be used for sorting words, documents, or other NLP-related data. By incorporating hierarchical merges and shared memory optimization, this code pushes GPU performance to the next level.⚡️ It’s all about reducing computational overhead and maximizing throughput in real-world scenarios. AI research is all about handling giant piles of data, so the code's a bit complex, but it's worth it for the speed boost. In AI, faster is always better, trying new things is key, and open-source tools are like gold for learning and sharing ideas Bringing these algorithms into production means faster systems and smarter resource usage, so you can have your data crunching and eat it too! 🍕 And guess what? You don’t need to be in Silicon Valley to code with CUDA for AI and beyond. Just bring your brainpower. 😎🚀 Feeling the power of parallel processing and still a coder at heart! 💻👾 #CUDA #ParallelComputing #GPUs #DataScience #HighPerformanceComputing #BitonicSort #TechInnovation #AlgorithmOptimization #GPUTech #DeepLearning #DataEngineering #MachineLearning #ComputationalPerformance #BigData #TechTrends #AIAlgorithms #SortingAlgorithms #RealTimeProcessing #CodeOptimization #PerformanceEngineering #ScalableComputing #TechRevolution #InnovationInTech #FastDataProcessing #HighSpeedComputing #HPC #AIresearch
To view or add a comment, sign in
-
Amazing how fast AI can progress with open models & weights. Here are 2 guys who managed to quantize DeepSeek’s R1 671B parameter model to 131GB in size. You could run this model in a PC with a good amount of RAM and/or a couple of 3090 GPU cards with 24GB vram. https://lnkd.in/gEqgdCVn
To view or add a comment, sign in