Bassel Mabsout

288 posts

Bassel Mabsout

Bassel Mabsout

@bmabsout

Katılım Mart 2012
172 Takip Edilen62 Takipçiler
Bassel Mabsout retweetledi
Pulkit Agrawal
Pulkit Agrawal@pulkitology·
Eka means unity -- “one,” in Sanskrit and “first” in Finnish. We’re building intelligence for the physical world in its native language: forces. Until now, robotics faced a tradeoff — generality or speed. The real world requires both. Robotics also faced a data problem. Our Vision–Force–Action (VFA) model — the first of its kind — breaks the generality-speed tradeoff and the data barrier. It's a new foundation uniting performance, generality, and safety for putting capable robots in everyone's hands. Today, I am excited to share our journey of pushing robots beyond human limits. Today, dexterity becomes scalable. Today, I welcome you to the Era of Eka. Co-founded with @haarnoja, and so thrilled and grateful to be working with a dream team at @EkaRobotics. Learn more: ekarobotics.com
English
63
221
1.9K
302.2K
Bassel Mabsout
Bassel Mabsout@bmabsout·
@aramh @asincole Will take a look at the talk! But is the wrapper issue not solved with having functions work on objects that are coercible to the datatype we want to work with? Or do you think coerce is not the right answer here?
English
1
0
0
38
Aram Hăvărneanu
Aram Hăvărneanu@aramh·
Because newtypes don't work when the old type is expected. You have to deal with what I call "the wrapper problem". As a programmer you to have to jungle these wrappers of wrappers and put them together after taking them apart instead of just writing code. I spoke at length about this problem in my appearance on youtube.com/watch?v=AfbwP9…
YouTube video
YouTube
English
1
0
0
101
Bassel Mabsout
Bassel Mabsout@bmabsout·
@aramh @asincole What's your opinion then on newtypes + newtype deriving and DerivingVia. Since the name of a datatype becomes the name of its unique instances, why isn't creating newtypes not the right way to do modularization?
English
1
0
0
54
Aram Hăvărneanu
Aram Hăvărneanu@aramh·
Traits in Rust and type classes in Haskell are canonical, a type (or several) can only implement a trait (or class) in one way. This is global, hence anti-modular. In Lean and Agda type classes or instance arguments are not global. You can have multiple, named, implementation and resolution happens in a scope, not globally. I prefer this much more, but I don't like it either because the resolution mechanism is too difficult to predict and control. In OCaml, with modules, the best part is that everything is explicit (but that is also the worst part).
English
3
2
16
1.4K
Bassel Mabsout
Bassel Mabsout@bmabsout·
@tritlo @ppavel24 What's the problem with letting it run until some configurable recursion limit? Even Java's silly typesystem is undecidable, If I mostly wrote types that can be inferred by Hindley-Milner, then it shouldn't hit those limits right?
English
0
0
2
31
Matti Palli 🧙‍♂️
@ppavel24 inference in full dependent types is undecidable! They’re too expressive, so you run into the halting problem
English
2
0
3
110
Bassel Mabsout
Bassel Mabsout@bmabsout·
@HSVSphere Please tell me you're avoiding the million notions of overriding that nix has, also it sounds like nickel in general has similar goals, do you know how the languages compare?
English
1
0
1
147
HSVSphere
HSVSphere@HSVSphere·
The ideals of Nix are almost perfect. The implementation sucks and it's only really usable if you know why it is the way it is (and why it's badly implemented). That's why I'm working on Cab, and I'm taking my time thinking everything through before committing to an implementation. It's not exactly a build language either. It doesn't specialize in any concept such as derivations, units, resources or whatever. It only lets you compose expressions with contexts—the stuff that makes Nix magical in the first place. What are expression contexts? They're when a subset of an expression implicitly carries the whole expression with it, as a context. It's how you don't explicitly specify what a Nix derivation depends on—it just works! The problem with Nix, ignoring all the QoL stuff that's missing [1] is that Nix contexts can *only* be used for derivations. They're not generic. derivationStrict is a builtin function, you cannot emulate it in the language itself. That prevents you from using contextful expressions for other things, such as process management. Resource (like terraform) management. Literally anything that forms a graph - it cannot be done cleanly! This is why a generic contextful-expression language is required. Cab will fix this, and that won't be the only thing it will fix. Cab has structural types enforced with a super novel system that works super well for a dynamic build system language. It has patterns, rather than hard coding for identifiers. I'd argue that its type system is going to be more powerful than typescript since there is no runtime-comptime difference (it's all the same, there's no IO either so it's simple). Anyway, stay tuned for an MVP. I also plan on supporting using the Nixpkgs package set with the system I'm going to build on top of Cab (the Cull Build System) using the efforts the Ekala project has been going through, eventually. However, I won't announce it until I have a working LSP, a unified documentation system that's flexible, a properly designed, hack-free "project" abstraction for the Cull Build System and a fast runtime overall (bye bye, NixOS module system and home-manager eval times). [1] - Flakes are a bad solution to the purity problem, nix-* commands shouldn't exist, FODs are hard to use, nixConfig can pwn you, too many tunables, too little separation of concerns between the distro, module system, nix itself, the nix daemon, literal oddities in nix, much much much more
Dillon Mulroy@dillon_mulroy

yup, decided. i'm ditching nix. i still think that there is no better solution, but i only use a small portion of it and when it breaks its too much work to justify

English
10
6
190
19.6K
Bassel Mabsout
Bassel Mabsout@bmabsout·
@keenanisalive I can see how the cooling schedule captures the essence of simulated annealing, but it's supposed to be a gradientless metaheuristic, is there some more general definition of simulated annealing that can be considered the global optimizer in conjunction with some local optimizer?
English
1
0
2
413
Keenan Crane
Keenan Crane@keenanisalive·
Mental map of Markov Chain Monte Carlo (MCMC) algorithms, and analogous machine learning (ML) algorithms [dashed = especially loose analogy]. Grey boxes are basic tools, and each arrow is annotated with the "delta" between algorithms.
English
9
172
1.1K
100.5K
Bassel Mabsout
Bassel Mabsout@bmabsout·
@n1mas_ @jsuarez Traditionally true, but with redq and crossq and soon AQS ;), this is changing. I can get a hopper hopping within 20000 samples, that is a far cry from the usual 1e6 samples this takes. Joseph's work makes iterating on the method faster which is the bottleneck in such research.
English
0
0
0
11
Joseph Suarez 🐡
Joseph Suarez 🐡@jsuarez·
RL is useless... except if you want super-human perf on games, control, LLMs, chip design, rideshare matching, 5G, and more! It's also an area where you can make major progress with very few resources. Join PufferAI's open source efforts at discord gg/puffer or DM me!
English
14
24
450
45.8K
Simo Ryu
Simo Ryu@cloneofsimo·
Am I the only one finding so weird that a lot of successful RL is based on evolutionary + gradient hybrid but so much of deep learning optimizer is purely greedy? Anyone pinpoints exactly why this is and how we can leverage more of zeroth order algorithm? btw we have stuff like this that might be middle ground, arxiv.org/abs/1907.08610 @kellerjordan0 apparently had a good shot with this, irreplacable with other optimizers in practice on his fastest CIFAR trainer,, Food for thought man, everywhere points to AdamW being messed up, why are we stuck with this shit (btw our lab doesnt use evolutionary hybrid. just PPO)
English
6
4
69
11.3K
Bassel Mabsout
Bassel Mabsout@bmabsout·
@emil_priver @ryanwinchester Wait what, what happens if you are merging 2 histories with different notes? Does it like do a note merge conflict? Or just has a take the one you're merging from strategy or something?
English
2
0
12
2.3K
Emil Privér
Emil Privér@emil_priver·
@ryanwinchester So 1 reason to use this is that you can update a note without rewriting history
English
2
1
75
2.8K
Emil Privér
Emil Privér@emil_priver·
Did you know that you could add notes to your git commits?
Emil Privér tweet media
English
37
53
1.1K
100.3K
Bassel Mabsout
Bassel Mabsout@bmabsout·
@VictorTaelin @Marc_Compere I love your approach, for scaling to harder problems, have you thought about how to know which parts of the function space to "focus" on? Gradient descent gets to know what part of the function is more important to change, I feel this is important for efficient search!
English
0
0
1
420
Taelin
Taelin@VictorTaelin·
The whole point of gradient descent is that it is a fast way to find functions. That's all. The problem is that, to use it, we must accept the limitations of the underlying architecture. Attention is nothing but a terrible programming language, where the only primitive is querying a neural dictionary. Imagine implementing a website using a Python where the only structure is a neural dictionary. It would be... hard. And that's the language GPTs have to work with! So, while GD is great at finding functions, it finds them in a crappy programming language that is, in turn, limited in many ways. My hypothesis is that all the well known limitations of current AI models are inherited from this lack of expressivity of attention. Now, what if we could search functions in a real programming language, as fast as GD finds them under attention? IF that was the case, then, transformers would be entirely obsolete, and we could train a model capable of doing all the things that current LLMs are notoriously bad at. Just to be clear, I'm not claiming that what I posted is that. But it SEEMS to be the optimal way to find unknown functions (in a very deep theoretical sense), and, because of that, I suspect it could play a role in a different AI architecture that isn't restricted or bound by the limitations of attention. And that could result in more competent models, in the sense they'd be able to comfortably do things that GPTs struggle with.
English
6
3
80
5.8K
Taelin
Taelin@VictorTaelin·
THE ALGORITHM IS COMPLETE 🥹 Finding XOR-XNOR: - Haskell: 2.8s - HVM: 0.0085s Based on the following tests: f(00100011) = 1011 f(10111001) = 0100 Solving for 'f' by search, we find: xor_xnor (0:0:xs) = 0 : 1 : xor_xnor xs xor_xnor (0:1:xs) = 1 : 0 : xor_xnor xs xor_xnor (1:0:xs) = 1 : 0 : xor_xnor xs xor_xnor (1:1:xs) = 0 : 1 : xor_xnor xs My best Haskell searcher, using the Omega Monad, takes 47m guesses. Meanwhile, the HVM searcher, using SUP Nodes, takes just 1.7m interactions, or 0.03 interactions per guess (!!!). This sounds too good to be true, so, before getting too excited, keep in mind *it is very very likely I'm doing something dumb*. As such, I request for validation. FP nerds: prove me wrong? (pls) I've published the Haskell code (and the full story, for these interested) below. Am I missing something? Is there some obvious way to optimize this Haskell search without changing the algorithm? If so, I'd love to hear it. Better embarrassed than pursuing the wrong idea 😅 Gist: gist.github.com/VictorTaelin/7…
English
32
22
569
64.8K
Taelin
Taelin@VictorTaelin·
@bmabsout that's the dream! but the tech is not there yet. I don't think we'll reach a huge audience anytime soon, but we must start somewhere. running high-level langs on GPUs is *hard*. with this release, we finally have something stable enough to be used. it is the very first step!
English
1
0
7
445
Taelin
Taelin@VictorTaelin·
so, when we get HOC's @ back, I'll make a proper post, but this might take 72h. for now, I'll nonchalantly announce that higherorderco dot com is up. years of research to put python inside gpus, and, there it is. or something kinda like it. I'll rest for now. see you soon 🥳
English
18
7
241
23.3K
Bassel Mabsout
Bassel Mabsout@bmabsout·
@AndreaVicere @getjonwithit I think you're confusing ball and sphere, a sphere is usually used to refer to the surface itself while a ball refers to the space inside a sphere, then the boundary of a 3d ball is a 2d spherical space
English
0
0
2
84
Jonathan Gorard
Jonathan Gorard@getjonwithit·
"The boundary of a boundary is always empty." A huge amount of (classical) physics, including much of general relativity and electromagnetism, can be deduced directly from this simple mathematical fact. Yet, on the surface, it doesn't seem to have much to do with physics. (1/10)
Jonathan Gorard tweet media
English
34
263
2.3K
484.1K
Bassel Mabsout
Bassel Mabsout@bmabsout·
@VictorTaelin I do believe in evolution + grad descent though, I think their combination is more powerful than either alone
English
0
0
1
43
Taelin
Taelin@VictorTaelin·
@bmabsout It is not efficient if the space of effective programs look like this
Taelin tweet media
English
2
0
1
400
Taelin
Taelin@VictorTaelin·
Guys I must be actually insane because I non-ironically think HOC will have AGI before OAI, and for reasons that seem so obvious to me? As in, if you throw a rock up, the rock will fall down. If you simulate evolution selecting for intelligence, with mass compute... 🧐
English
36
3
147
69.8K
Bassel Mabsout
Bassel Mabsout@bmabsout·
@VictorTaelin True, that's how the loss space looks like if you're looking for one effective program, but not if you have 10000 programs and you are finding a function which makes programs that slightly improves on the total effectiveness
English
0
0
1
110
Bassel Mabsout
Bassel Mabsout@bmabsout·
@VictorTaelin Gradient descent is very efficient though, it allows us to at least locally attribute how every variable in the system contributes solving the task at hand. I can't just know what lambda term to change to produce a better output. Also, memetic algs are evolution + grad descent
English
1
0
3
847
Taelin
Taelin@VictorTaelin·
just to be clear, I'm not saying I'm better than anyone; just that it makes so much logical sense that evolving intelligence by iterating small λ-terms would be much more efficient than the slow-gradient-descent-over-colossal-matrices approach. I just wonder where this is wrong
English
10
0
39
7.3K
Bassel Mabsout
Bassel Mabsout@bmabsout·
@ereb0s_labs Might be the case that eventually gpt-2 will converge at a lower loss, but sounds like you've got a paper on your hands! Also it might be a function of hyperparameter tuning, since you've probably trained the model you propose multiple times on this dataset as you iterated
English
0
0
1
76
ereb0s
ereb0s@ereb0s_labs·
I'm not sure I have enough reach with this account to get a proper response. I need help understanding if I can trust the results I'm seeing for a custom LLM model I've been researching. Yellow - GPT2 Purple - custom model More info below.
ereb0s tweet media
English
2
0
6
1.9K
Bassel Mabsout
Bassel Mabsout@bmabsout·
@cs_kaplan @akivaw @jrdnfrd @keenanisalive Unless you allow for an infinite number of tiles, I don't see how a finite subsection can ever encode an infinite coordinate, isn't it guaranteed that the number of possible configurations of a finite number of tiles in a finitely sized region be finite?
English
0
0
1
53
Keenan Crane
Keenan Crane@keenanisalive·
Pretty awesome discovery: a single shape that tiles the infinite plane without repetition. If you're staring straight down at a checkerboard, there's no way to tell where you are: every part looks the same. But here, the relative arrangement of tiles encodes your location.
English
7
46
443
48.2K