The only comment I'd add is that limitation #2 (The Gullible Memory risk) isn't just about adversarial inputs. It also makes this approach somewhere between less than ideal and unusable for any inputs which change over time. Which, fortunately or unfortunately, a very large amount of information does.
For example, if it learns about a businesses core products & services, and the business discontinues some products and focuses on others, the memory seems likely to continue to retrieve old information. And since that old information was likely built up over time, then the od may very well drown out the new. At a minimum, this would dilute newer, more accurate information; at a maximum it would in some sense hallucinate about the past - even though it has information about the present in its memory as well.
So, eventually, memory architectures will need some mechanism to either automatically purge memories that are no longer relevant, or assign and adjust priorities to different memories over time. If I tell a human that our strategy has changed, they handle this automatically - they understand both that there's a current state and a historical state (both of which are valuable to remember, but for different reasons). LLM memory will eventually need to figure out how to do this as well.
Quick thought: What if the information placed into memory was time stamped, and when the memory contains multiple conflicting hits on a topic, it automatically prioritizes the most recent information over the past information? This seems like it would more closely match how us humans handle changing information over time - and might also provide a way to more easily counter adversarial information planted in the past.
Hi Maria, thank you for sharing this work — it is one of the most practical write-ups I’ve seen on grafted neural memory for frozen LLMs.
I tried to reproduce a similar “grafted HOPE/Titans-style” setup in my own project, but my results are currently far from stable. If you’re open to sharing more details, I would really appreciate guidance on a few implementation points:
1) Where exactly do you inject the memory stream (before/after token embeddings, and at which layer boundaries)?
2) How is your cross-attention gate parameterized and trained (loss terms, scaling, regularization)?
3) What optimizer/lr schedule worked for memory + gate training with a frozen backbone?
4) In the 2-turn protocol, how do you handle memory updates between turns (frequency, reset policy, truncation/compression)?
5) Do you have any practical tricks to prevent the base model from ignoring memory early in training?
Also, is there already a public repo (or planned timeline) for code/weights release? Even a minimal reference implementation would be extremely helpful for reproducibility.
Thanks again for publishing this openly — it’s genuinely inspiring work.
This looks excellent.
The only comment I'd add is that limitation #2 (The Gullible Memory risk) isn't just about adversarial inputs. It also makes this approach somewhere between less than ideal and unusable for any inputs which change over time. Which, fortunately or unfortunately, a very large amount of information does.
For example, if it learns about a businesses core products & services, and the business discontinues some products and focuses on others, the memory seems likely to continue to retrieve old information. And since that old information was likely built up over time, then the od may very well drown out the new. At a minimum, this would dilute newer, more accurate information; at a maximum it would in some sense hallucinate about the past - even though it has information about the present in its memory as well.
So, eventually, memory architectures will need some mechanism to either automatically purge memories that are no longer relevant, or assign and adjust priorities to different memories over time. If I tell a human that our strategy has changed, they handle this automatically - they understand both that there's a current state and a historical state (both of which are valuable to remember, but for different reasons). LLM memory will eventually need to figure out how to do this as well.
Quick thought: What if the information placed into memory was time stamped, and when the memory contains multiple conflicting hits on a topic, it automatically prioritizes the most recent information over the past information? This seems like it would more closely match how us humans handle changing information over time - and might also provide a way to more easily counter adversarial information planted in the past.
Nice work Maria! Looking forward to the full release, thank you for the basic summary first.
Hi Maria, thank you for sharing this work — it is one of the most practical write-ups I’ve seen on grafted neural memory for frozen LLMs.
I tried to reproduce a similar “grafted HOPE/Titans-style” setup in my own project, but my results are currently far from stable. If you’re open to sharing more details, I would really appreciate guidance on a few implementation points:
1) Where exactly do you inject the memory stream (before/after token embeddings, and at which layer boundaries)?
2) How is your cross-attention gate parameterized and trained (loss terms, scaling, regularization)?
3) What optimizer/lr schedule worked for memory + gate training with a frozen backbone?
4) In the 2-turn protocol, how do you handle memory updates between turns (frequency, reset policy, truncation/compression)?
5) Do you have any practical tricks to prevent the base model from ignoring memory early in training?
Also, is there already a public repo (or planned timeline) for code/weights release? Even a minimal reference implementation would be extremely helpful for reproducibility.
Thanks again for publishing this openly — it’s genuinely inspiring work.
> a lot of providers that announce millions token context windows cap it to no more than 100k
...wait, what?!
Yep, here for example
https://support.google.com/gemini/answer/16275805
Curious what the M4 Mac Mjni was used for and configuration? Pro? Ram? Storage?