Synthetic data: save money, time and carbon with open source
About this article
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Back to Articles Synthetic data: save money, time and carbon with open source Published February 16, 2024 Update on GitHub Upvote 85 +79 Moritz Laurer MoritzLaurer Follow tl;dr Should you fine-tune your own model or use an LLM API? Creating your own model puts you in full control but requires expertise in data collection, training, and deployment. LLM APIs are much easier to use but force you to send your data to a third party and create costly dependencies on LLM providers. This blog post shows how you can combine the convenience of LLMs with the control and efficiency of customized models. In a case study on identifying investor sentiment in the news, we show how to use an open-source LLM to create synthetic data to train your customized model in a few steps. Our resulting custom RoBERTa model can analyze a large news corpus for around $2.7 compared to $3061 with GPT4; emits around 0.12 kg CO2 compared to very roughly 735 to 1100 kg CO2 with GPT4; with a latency of 0.13 seconds compared to often multiple seconds with GPT4; while performing on par with GPT4 at identifying investor sentiment (both 94% accuracy and 0.94 F1 macro). We provide reusable notebooks, which you can apply to your own use cases. Table of Contents 1. The problem: There is no data for your use-case 2. The solution: Synthetic data to teach efficient students 3. Case study: Monitoring financial sentiment 3.1 Prompt an LLM to annotate your data 3.2 Compare the open-source model to proprietary models 3.3 ...