Wasim A Iqbal, PhD

1.7K posts

Wasim A Iqbal, PhD banner
Wasim A Iqbal, PhD

Wasim A Iqbal, PhD

@waz_147

Newcastle Upon Tyne, England Katılım Ağustos 2011
392 Takip Edilen304 Takipçiler
Sabitlenmiş Tweet
Wasim A Iqbal, PhD
Wasim A Iqbal, PhD@waz_147·
I am excited to share a new paper. It demonstrates machine learning can predict kinetics of numerous plant Rubiscos. This will ease bioengineering efforts and may allow species-specific parameterization of global photosynthesis models. 1/5 @JXBot doi.org/10.1093/jxb/er…
Newcastle Upon Tyne, England 🇬🇧 English
4
37
145
0
Wasim A Iqbal, PhD
Wasim A Iqbal, PhD@waz_147·
Alhamdulillah starting my first permanent role tomorrow. Excited to work on projects helping to reduce inequality in Newcastle using A.I.
English
1
0
6
171
Wasim A Iqbal, PhD retweetledi
Dr. Ryan Thompson
Dr. Ryan Thompson@RyanMicroBio·
Very happy to have submitted my PhD thesis today, would not have been possible without the support of my wonderful supervisor @MonteroCalasanz Likewise, very pleased to have been awarded three months of post-submission funding by @UniofNewcastle to continue my doctoral research.
Dr. Ryan Thompson tweet media
English
2
1
23
1.1K
Wasim A Iqbal, PhD
Wasim A Iqbal, PhD@waz_147·
If you summarized the boomers using a single photo
English
0
0
3
442
ₕₐₘₚₜₒₙ
ₕₐₘₚₜₒₙ@hamptonism·
The universal approximation theorem states that a neural network with one hidden layer can approximate continuous functions on compact sets with any desired precision.
English
29
192
2K
218.7K
Rohan Paul
Rohan Paul@rohanpaul_ai·
For systematically maximizing the performance of deep learning models, really a very comprehensive guide is Google's tuning playbook
Rohan Paul tweet media
English
3
45
343
46.4K
Wasim A Iqbal, PhD
Wasim A Iqbal, PhD@waz_147·
@predict_addict Pretty shitty move. Maybe release a dissertation against this work instead of publicly ousting someone's hard work.
English
0
0
4
329
Valeriy M., PhD, MBA, CQF
Valeriy M., PhD, MBA, CQF@predict_addict·
Never ask a woman her age, a man his salary or a Cambridge machine learning department why waste taxpayer funds on frameworks that neither work nor scale like on Gaussian processes or Bayesian deep nets. #bayesianism
Valeriy M., PhD, MBA, CQF tweet media
English
43
149
1.4K
350K
Cameron R. Wolfe, Ph.D.
Cameron R. Wolfe, Ph.D.@cwolferesearch·
Recently, I’ve run hundreds of instruction tuning experiments with LoRA/QLoRA, and I wanted to share some (basic) code and findings that might be useful… The code (see replies) contains an instruction tuning script using LoRA/QLoRA and the Alpaca dataset, as well as evaluation code that uses the test set from Vicuna. The repo contains scripts for both training and observing model output. Most of my experiments were run with Mistral-7B using a 2x3090 GPU workstation (full training script takes a few hours to complete). When running instruction tuning experiments with LoRA, I started to observe some practical takeaways that I found to be (relatively) useful and generalizable across several models and datasets… 1. Using too high of a rank for LoRA typically leads to overfitting. A low rank (r=8/16) seems to be sufficient in most cases. 2. Adding dropout to the LoRA adapters didn’t do much to prevent overfitting in my experience. 3. For both LoRA and QLoRA, adding LoRA adapters to all linear layers in the model seemed to yield the best performance. 4. Given a properly tuned learning rate, the best performance was typically achieved using a constant learning rate schedule with a short (e.g., 2%-5% of iterations) warmup period. Using a cosine decay schedule for the learning rate did not improve performance much and led to worse overfitting in certain cases. 5. Adding a small weight decay (e.g., 1e-4) helps with overfitting. 6. Performing two training epochs can yield better performance in certain cases, but going beyond two epochs (e.g., three epochs) nearly always causes overfitting. 7. Sufficiently large batch sizes (e.g., around 64 or 128) are important for training stability (if you don’t have enough GPU memory just use gradient accumulation!). Batch sizes of 8 or 16 led to chaotic training curves and prevented convergence in some cases. 8. In general, observing model outputs on a variety of evaluation sets (e.g., held out Alpaca examples, the vicuna evaluation set, or hand-written prompts) was way more informative than tracking training/evaluation metrics. One other interesting observation that I had is that finetuning with LoRA (as opposed to QLoRA) is not always simple on consumer GPUs (e.g., 3090s), even with smaller LLMs. When finetuning Mistral-7B on the Alpaca dataset, I had to use a reduced sequence length (64-128 tokens) during training to avoid running out of memory (and I still hit sporadic OOMs). I’m not sure if other packages (e.g., LitGPT) better manage memory, but I was personally surprised that LoRA finetuning was non-trivial for a 7B model in bfloat16 on a 3090 GPU.
Cameron R. Wolfe, Ph.D. tweet media
English
23
74
441
90.9K
Isomorphic Labs
Isomorphic Labs@IsomorphicLabs·
We're excited to announce #AlphaFold 3 with @GoogleDeepMind in @Nature: our new AI model for predicting biomolecule structures with unprecedented breadth and accuracy. Expanding beyond proteins to tackle DNA, RNA, small molecules to fuel advances in biology & drug design 🧵
English
13
202
977
296.8K
Valeriy M., PhD, MBA, CQF
Valeriy M., PhD, MBA, CQF@predict_addict·
KAN is awesome and works exactly as mentioned in the paper. MLPs are struggling to approximate many functions and KAN by design combines Kolmogorov Arnold ideas and fuses them with the best of what MLPs can offer. The result is awesome KAN. Colab in the original post by @milos_ai #KAN
Valeriy M., PhD, MBA, CQF tweet media
English
12
52
448
36.9K
dr. jack morris
dr. jack morris@jxmnop·
one of the most important things I know about deep learning I learned from this paper: "Pretraining Without Attention" this what I found so surprising: these people developed an architecture very different from Transformers called BiGS, spent months and months optimizing it and training different configurations, only to discover that at the same parameter count, a wildly different architecture produces identical performance to transformers this may imply that as long as there are enough parameters, and things are reasonably well-conditioned (i.e. a decent number of nonlinearities and and connections between the pieces) then it really doesn't matter how you arrange them, i.e. any sufficiently good architecture works just fine i feel there's something really deep here, and we may be already very close to the upper bound of how well we can approximate a given function given a certain amount of compute. so we should spend more time thinking about other questions, such as what that function should actually look like (what data? which objective function?) and how to make it more efficient
dr. jack morris tweet media
English
93
408
3.1K
489.2K