Documentation Index
Fetch the complete documentation index at: https://chainstack-mintlify-flesh-empty-pages.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Previous section: AI trading agent: Generative adversarial networks and synthetic data
Project repository: Web3 AI trading agent
Understanding model deployment options
Choose your deployment strategy based on requirements The fine-tuning process produces LoRA (Low-Rank Adaptation) adapters that modify the base model’s behavior without altering the original weights. You have three deployment options: Option 1: Direct LoRA usage- Pros: Smallest memory footprint, fastest deployment
- Cons: Requires MLX runtime, adapter loading overhead
- Best for: Development, testing, resource-constrained environments
- Pros: Single model file, no adapter dependencies, consistent performance
- Cons: Larger file size, permanent modification
- Best for: Production deployment, sharing, simplified distribution
- Pros: Easy API access, model versioning, production-ready serving
- Cons: Additional quantization step, external dependency
- Best for: API-based integration, multi-user access, scalable deployment
If you are on your learning path, I suggest trying out all three to get a feel of the process and to see the behavior difference (e.g., inference time). It doesn’t take too much time; follow the instructions further in this section.
Direct LoRA adapter usage
Direct LoRA usage provides immediate access to your specialized trading model. If you did a quick test/validation of your fine-tuned model loaded withadapters.safetensors, you already did this direct LoRA adapter usage test.
Here it is again:
Model fusion for production deployment
Fusion combines LoRA adapters with base model weights, creating a single model file with embedded trading knowledge.Understanding the fusion process
LoRA fusion is a technique to directly integrate specialized knowledge from adapter weights into the base model parameters. Mathematically, this involves taking the original Qwen 2.5 3B model parameters and combining them with the adapter’s low-rank matrices.Practically, this is sort of a double-edged sword: on the one hand, the model grows from roughly 2 GB (when using adapters separately) to around 6 GB in its fully fused state; on the other hand, inference loading times improve significantly because adapter loading overhead is eliminated. Additionally, the fused model maintains consistent and stable inference speeds without adapter-related delays.
Performing model fusion
Execute fusion with appropriate settingsVerify fusion success
Test the fused model:Converting to Ollama format
Ollama provides production-grade model serving with API access and it just works very well & smooth. Set upllama.cpp for model conversion:
cd ..):
Converting MLX to GGUF format
Run the conversion script:Quantizing for efficiency
Optionally, apply e.g. Q4_K_M quantization for performance (4-bit with K-quant method):| Format | Size | Speed | Quality | Use Case |
|---|---|---|---|---|
| F16 | 100% | Medium | Highest | Development/Testing |
| Q8_0 | ~50% | Fast | High | Balanced Production |
| Q4_K_M | ~25% | Fastest | Good | Resource Constrained |
| Q2_K | ~12% | Very Fast | Lower | Extreme Efficiency |
Creating Ollama model
Register quantized model with Ollama with the following instructions. Note that I’m using thetrader-qwen:latest model name in these instructions—make sure you change to your name, if you have a different one.
Create Ollama Modelfile with configuration:
Test Ollama deployment
Test model with trading prompt:Integrating custom models with trading agents
Update your trading agent configuration to leverage the custom-trained model.Configuration for Ollama integration
Updateconfig.py for Ollama model usage:
Configuration for direct MLX usage
Alternative: Use MLX adapters directly. MLX-based configuration inconfig.py: