Ajay S.’s Post

View profile for Ajay S., graphic

Transform Tech and AI Investments into Tangible ROI for Your Business, Happy to connect

Small language are very powerful at doing specific tasks.In machine learning, particularly for large language models (LLMs), improving performance while reducing training costs is crucial. One effective way to do this is by enhancing the quality of the pretraining data through a process called data pruning. Data pruning involves selecting the best parts of a large dataset to train the model, removing noisy and irrelevant data. This helps streamline the training process and boosts the model's performance. A common problem with training LLMs is that the data can be vast and messy. Poor-quality data can make the models perform poorly, so it's important to filter out the bad data and keep only the good stuff. Traditional methods for this include basic filtering rules and simple classifiers, but these often fall short when dealing with large, diverse datasets. A new, more advanced method has been developed by researchers from Databricks, MIT, and DatologyAI. They use small reference models to measure something called perplexity, which helps determine how well a model can predict a piece of text. Lower perplexity scores mean better quality data. Here's how it works: 1.Train a Small Model: A small model is trained on a random subset of the data. 2.Compute Perplexity: This small model then evaluates the perplexity of each sample in the larger dataset. 3.Prune Based on Perplexity: Data samples with the lowest perplexity scores (indicating high quality) are selected. 4.Train the Larger Model: The larger model is then trained using this high-quality, pruned dataset. This method has shown to improve the performance of large models significantly. For example, using perplexity scores from a smaller 125 million parameter model to prune data improved the performance of a much larger 3 billion parameter model by up to 2.04%. It also reduced the pretraining steps needed to reach a good performance level by up to 1.45 times. This perplexity-based pruning is effective in various scenarios and datasets, demonstrating its robustness. It enhances model performance and reduces the computational resources required, making it a valuable tool for data researchers. In essence, by using smaller models to filter out bad data, researchers can train bigger models more efficiently and effectively. Link to paper: https://lnkd.in/dE6KkrTn #llm #ai #generativeai

2405.20541

arxiv.org

To view or add a comment, sign in

Explore topics