Justin Willis’ Post

Member of Technical Staff @ Microsoft AI | Web

7mo

Updated my web-ai-toolkit library yesterday to use the new v3 version of transformers.js from Hugging Face ! This turns on GPU support for speech to text and summarization, using the native support in transformers.js! https://lnkd.in/gA76zdXP

web-ai-toolkit

npmjs.com

To view or add a comment, sign in

More Relevant Posts

Joseph Stein

Architect, Developer, Security Professional
2mo
Report this post
I am working on a system that is Hybrid Graph and Vector RAG very much like this https://lnkd.in/ehzkT9mK (I started it before I read the paper but the paper (Blackrock and NVIDIA) was interesting and was helpful in some ways) if anyone is working on these type of systems and want to talk tech and compare notes without any use case detail or information let me know would love to Tango.

HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction

arxiv.org

4 Comments
Like Comment
To view or add a comment, sign in
Orlando Andico

Cloud Native Databases, Infrastructure, AI | 15x AWS Certified
4mo Edited
Report this post
While I wasn't looking, Ollama gained support for Llama-3.2-Vision, 4-bit quants are available for 11B and 90B models. Inference with the 11B model takes 30s on M3 Pro Mac, 40s on Nvidia P40, 15-20s on Nvidia A10G, and 16-25s on Nvidia L4. Bedrock takes 8s for Claude-3.5-Sonnet and 5s for Claude-3-Haiku. Cost per image with A10G (g5.xlarge with 3-year reservation) is 0.18¢, Claude-3.5 Sonnet 0.48¢, Claude-3.5 Haiku 0.16¢, Claude-3 Haiku 0.04¢. The cheapest Haiku model is broadly comparable to Llama-3.2-Vision but significantly cheaper, and is a managed service.

llama3.2-vision

ollama.com
Like Comment
To view or add a comment, sign in
Vast.ai

5,737 followers
9mo
Report this post
Text-Embeddings-Inference is an open source framework for embedding and reranker inference. It is made by Hugging Face and is focused on low latency, high throughput serving with short boot times due to a small docker image.

Serving Online Inference with Text-Embeddings-Inference on Vast.ai | June 2024 | Vast.ai

vast.ai
Like Comment
To view or add a comment, sign in
Maharath Sinha

Building the Future of Identity Verification | Founder, Innovator, AI Engineer
5mo
Report this post
A week ago, George Aristides and I developed a small Competitive Model Comparison (CMC) 🚰 pipeline 🚰 for various sklearn's machine learning models, streamlining the process significantly. This pipeline reduced the lines of code required to obtain basic model scores from several hundred to just about 3 lines. It also incorporates hyperparameter tuning with k-fold CV and visualizations, such as trees for Decision Trees. We have plans to enhance the pipeline even further by adding features like automatic dataset balancing and boosting SVM training speeds. In the meantime, this tool has enabled us to efficiently work on multiple large datasets, conduct exploratory data analysis, and train 8 classic-ml models in just minutes! On a fun note, our continuous pipeline work led George to humorously name this creation 'plumber'... 😆 Check out our collaborative project here: https://lnkd.in/gE-UzWX6 #machinelearning #ml #ai #training #data_analysis #datasets

GitHub - sin31415/plumber-cmc: A competitive machine learning model comparison pipeline.

github.com
Like Comment
To view or add a comment, sign in
Shiji Xin

Harvard DS 25' | PKU 23' | RDFZ 19'
9mo
Report this post
#LLMSys For LLM serving, a homogeneous setting may not be cost-effective. The paper "Efficient and Economic Large Language Model Inference with Attention Offloading" (https://lnkd.in/ed3aRDu2) shows that combining two different GPUs and separating attention/linear calculations (as they have different memory/compute requirement) actually achieves higher throughput per dollar. (I also wondered about serving a language model by combining a 3090 and a much cheaper P40 at home😺)

Efficient and Economic Large Language Model Inference with Attention Offloading

arxiv.org
Like Comment
To view or add a comment, sign in
Georgios Soloupis

Android @ Envision | Accessibility @ Zolup browser | AI & Android Google Developer Expert
6mo
Report this post
Implementation of FastSAM by Ultralytics using LiteRT (formerly TensorFlow Lite). The inference time is as low as 300ms per camera frame with GPU usage, with some additional time required for post-processing. For more details on the project, including links to the GitHub repository, check out this: https://lnkd.in/ddUBRqXF #AISprint #litert #googlefordevelopers

Implement LiteRT for a segmentation task utilizing the FastSAM model by Ultralytics.

farmaker47.medium.com
Like Comment
To view or add a comment, sign in
Chris Fey

A.I. Consultant
4mo
Report this post
Pixtral 12B in short: Natively multimodal, trained with interleaved image and text data Strong performance on multimodal tasks, excels in instruction following Maintains state-of-the-art performance on text-only benchmarks Architecture: New 400M parameter vision encoder trained from scratch 12B parameter multimodal decoder based on Mistral Nemo Supports variable image sizes and aspect ratios Supports multiple images in the long context window of 128k tokens Use: License: Apache 2.0 Try it on La Plateforme or on Le Chat

Announcing Pixtral 12B

mistral.ai
Like Comment
To view or add a comment, sign in
Artificial Intelligence Feed

1,110 followers
2mo
Report this post
Understanding Flash Attention: Writing the Algorithm from Scratch in Triton Find out how Flash Attention works. Afterward, we’ll refine our understanding by writing a GPU kernel of the algorithm in Triton.Continue reading on Towards Data Science »... https://lnkd.in/dBMeQJiP #AI #ML #Automation

Understanding Flash Attention: Writing the Algorithm from Scratch in Triton

openexo.com
Like Comment
To view or add a comment, sign in
Dempsey D.

CEO/Founder 😈 | Cross-Border Entrepreneur 🌎 | CVC, seed stage | AI Edtech 💖 | Tech Venture 🦄👨💻
6mo Edited
Report this post
Optimizing Parallel Sorting with CUDA: Bitonic Sort in Action! 🚀 Just played around with a Bitonic Sort algorithm using CUDA. Sometimes you gotta take a break from the usual and get back to the nitty-gritty. And as a tech founder who loves C++ & CUDA, this was right up my alley. We're talking high-speed parallel sorting for HUGE datasets! 🔥 So, what is Bitonic Sort? 🤔 It's a parallel sorting algorithm designed to efficiently sort elements on GPUs. CUDA allows us to harness the full power of GPU cores, making it perfect for large-scale computations where speed is critical. In this implementation, we use CUDA kernels to process multiple elements simultaneously, enabling fast sorting by leveraging warp-level parallelism and shared memory. What’s the use case? 🤔 From high-frequency trading to scientific simulations, wherever large data arrays need sorting quickly, Bitonic Sort can significantly improve performance. Whether handling real-time data streams or crunching large-scale computational tasks, it scales efficiently with hardware. In addition, this code has applications in Machine Learning, crucial for tasks like K-Nearest Neighbors, Decision Trees, and Gradient Boosting. In Deep Learning, sorting aids Quantization and Data Preprocessing. Computer Vision also benefits, using sorting in Image Processing and Object Recognition also NLP models often require efficient processing of large text corpora. Bitonic sort could be used for sorting words, documents, or other NLP-related data. By incorporating hierarchical merges and shared memory optimization, this code pushes GPU performance to the next level.⚡️ It’s all about reducing computational overhead and maximizing throughput in real-world scenarios. AI research is all about handling giant piles of data, so the code's a bit complex, but it's worth it for the speed boost. In AI, faster is always better, trying new things is key, and open-source tools are like gold for learning and sharing ideas Bringing these algorithms into production means faster systems and smarter resource usage, so you can have your data crunching and eat it too! 🍕 And guess what? You don’t need to be in Silicon Valley to code with CUDA for AI and beyond. Just bring your brainpower. 😎🚀 Feeling the power of parallel processing and still a coder at heart! 💻👾 #CUDA #ParallelComputing #GPUs #DataScience #HighPerformanceComputing #BitonicSort #TechInnovation #AlgorithmOptimization #GPUTech #DeepLearning #DataEngineering #MachineLearning #ComputationalPerformance #BigData #TechTrends #AIAlgorithms #SortingAlgorithms #RealTimeProcessing #CodeOptimization #PerformanceEngineering #ScalableComputing #TechRevolution #InnovationInTech #FastDataProcessing #HighSpeedComputing #HPC #AIresearch

Optimizing Parallel Sorting with CUDA: Bitonic Sort in Action! 🚀

gist.github.com
Like Comment
To view or add a comment, sign in
Phil Tomson

Software Developer
2mo
Report this post
Amazing how fast AI can progress with open models & weights. Here are 2 guys who managed to quantize DeepSeek’s R1 671B parameter model to 131GB in size. You could run this model in a PC with a good amount of RAM and/or a couple of 3090 GPU cards with 24GB vram. https://lnkd.in/gEqgdCVn

Run DeepSeek-R1 Dynamic 1.58-bit

unsloth.ai
Like Comment
To view or add a comment, sign in

893 followers

163 Posts

View Profile Connect

Justin Willis’ Post

More Relevant Posts

Explore topics