Jordan

454 posts

Jordan

@ZedDou1

🥦

Katılım Mayıs 2013

170 Takip Edilen18 Takipçiler

Jordan@ZedDou1·5 Ara

@JinaAI_ @counting_words Therefore, shouldn't the whole approach work even without needing to actually inject the watermarks since we should have in some extent semantic preservation between original, watermarked and paraphrased text no matter if it is paraphrasing original or watermarked text?

English

Jina AI@JinaAI_·28 Kas

@counting_words it’s pretty much like the original input. you can find some examples from the blog post.

English

Jina AI@JinaAI_·27 Kas

Text watermarking using — surprise — embedding models?! And those watermarks persist after paraphrasing & translation—one of the most "out-of-domain" usages of embeddings we learned at EMNLP2024. It leverages the long-context and cross-lingual features of jina-embeddings-v3 to create a robust watermark system. But first, what is a good text watermark?

English

Jordan@ZedDou1·29 Eyl

@TejasReddyS1 @victorialslocum @AnthropicAI Basically, your first request that includes the whole document will take time but once it is done, the key and value vectors of your document can be cached and will not need to be re-computed again, you will only have to process the extra text in your prompt (e.g. the chunk here)

English

Jordan@ZedDou1·29 Eyl

@TejasReddyS1 @victorialslocum @AnthropicAI It relates to the KV cache that is used for Transformers, you can have an explanation here : training.continuumlabs.ai/inference/why-…

English

Victoria Slocum@victorialslocum·27 Eyl

Contextual Retrieval is a new RAG method introduced by @AnthropicAI, where context is automatically inserted into each chunk by an LLM. A naive chunking method will create a vector embedding independently for each chunk, and then RAG systems use these embeddings to find matching chunks to the query. But there’s an issue - lost context - where chunks lose the context of the original document. Anthropic's method, Contextual Retrieval, is a brute-force strategy to combat the issue of lost context: 1. Each chunk is sent to the LLM alongside its full document 📃 2. An LLM adds relevant context to every chunk 🧩 3. This results in richer and more informative embeddings 🔎 @JinaAI_ developed a separate strategy, late chunking, which is a different way of handling lost context. In late chunking, the full document is first embedded and then chunked afterwards, so that full document context is included in the chunked embedding. Both methods solve this problem, but how do they stack up? 💻 Contextual retrieval requires evaluating an LLM for each chunk (expensive) 📊 Both methods can include the missing context in the chunk embeddings Whilst contextual retrieval could cost a lot, you could use a smaller LM for this task, or use Anthropic’s new prompt caching feature to bring these costs down. Take a look at the similarity scores below for a basic example. Both late chunking and contextual retrieval perform similarly and improve on naive chunking! This method is still young, so we still don’t know what will work best in a real-world RAG application 👀 See it in action in this notebook by @DanielW966: github.com/weaviate/recip…

English

395

38K

Jordan@ZedDou1·28 Eyl

@TejasReddyS1 @victorialslocum @AnthropicAI To me some local context would be enough for that (while providing less noise to generate the context) however you wouldn't be able to fully exploit the prompt caching. It feels to me that the presented approach serves as an ad for it to be honest

English

Jordan@ZedDou1·28 Eyl

@TejasReddyS1 @victorialslocum @AnthropicAI From my understanding, that's a limitation of the presented approach. However, I doubt the usefulness of feeding something like a 100-page pdf just to prepend 50-100 tokens of context to the chunk.

English

Jordan@ZedDou1·28 Eyl

@TejasReddyS1 @victorialslocum @AnthropicAI The full document from which the chunk comes from

English

Tejas Reddy S@TejasReddyS1·28 Eyl

@victorialslocum @AnthropicAI Each chunk is sent to the LLM alongside its full document What does full documents mean?

English

222

Jordan@ZedDou1·19 Eyl

@michael_g_u @arankomatsuzaki Thanks for the answer! I understand the explanation and it definitely makes sense to me but I'm still struggling with the notion of right/left side To me, we usually compare a query to a piece of text and there's no notion of positioning that's why I'm struggling to understand

English

Michael Günther@michael_g_u·19 Eyl

@ZedDou1 @arankomatsuzaki The similarity is symmetric but not the loss. you always compare the similarity of a text from the left side to all the similarities of texts on the right side. Bidirectional means, that we also compare the right sides to all left sides. So you have negatives on both sides.

English

Aran Komatsuzaki@arankomatsuzaki·17 Eyl

jina-embeddings-v3: Multilingual Embeddings With Task LoRA - Achieves SotA perf on multilingual and long-context retrieval - Outperforms proprietary models on English - Up to 8k context length hf: huggingface.co/jinaai/jina-em… abs: arxiv.org/abs/2409.10173 Edit: Wait for the hf page to be updated soon

English

268

21.1K

Jordan@ZedDou1·21 Ağu

@tonywu_71 @bclavie Just curious whether you tried the 3 ways of selecting the tokens to pool as mentioned in the blog? And if so did you also observe that hierarchical clustering for selecting tokens works better? In any case, well done! 😁

English

Tony Wu@tonywu_71·21 Ağu

✨Remember @bclavie's awesome blog post on Token Pooling for ColBERT? Ofc we had to try it on ColPali, and the results are exciting: we managed to reduce the total number of vectors in document embeddings by ≈66.7% while retaining ≈97.8% of the original performance! 🧵(1/n)

Ben Clavié@bclavie

🥁🥁 New blog post out (link in thread), w/ two aims: 🤓 Providing a clear, hopefully easy-to-read intro to ColBERT, without assuming you've ever used it. 🏊Introducing ColBERT Token Pooling ✨: You can reduce the size of ColBERT indexes by 66% with barely any performance hit!

English

26.6K

Jordan@ZedDou1·26 Tem

@diervo @sbergman @omarsar0 @simonw From my understanding, they simply use it as an encoder of the query/chunk and perform the regular cosine distance between the resulting embeddings to get the top k chunks

English

168

Diego Ferreiro Val@diervo·26 Tem

Google now allows you to cache immutable parts of the prompt (image th set of articles only changes so often), I assume they still charge you though. What I don't understand from this paper is how this dragon retriever works, is it used for create the embeddings first and then retrieval?

English

elvis@omarsar0·25 Tem

Very interesting study on comparing RAG and long-context LLMs. Main findings: - long-context LLMs outperform RAG on average performance - RAG is significantly less expensive On top of this, they also propose Self-Route, leveraging self-reflection to route queries to RAG or LC. Report that Self-Route significantly reduces computational cost while maintaining comparable performance to LC. Interesting result: "On average, LC surpasses RAG by 7.6% for Gemini-1.5-Pro, 13.1% for GPT-4O, and 3.6% for GPT-3.5-Turbo. Noticeably, the performance gap is more significant for the more recent models (GPT-4O and Gemini-1.5-Pro) compared to GPT-3.5-Turbo, highlighting the exceptional long-context understanding capacity of the latest LLMs." Again, not sure why Claude was left out of the analysis. I would love to see that including other custom LLMs trained to perform better at RAG. I am not entirely convinced that long-context LLMs generally can outdo RAG systems today. But I think it's interesting to see a combination of the approaches which is something I've been advocating for recently.

English

174

682

50.8K

Jordan@ZedDou1·24 Tem

@tonywu_71 Thank you for the clear answers and the amazing work! Really like the approach and especially the explainability part! Keep up the good work! 🚀

English

Tony Wu@tonywu_71·24 Tem

@ZedDou1 However, I personally really like the idea of running VQA directly from ColPali's document embeddings but it would require further model training. Our team is considering experimenting with this, so stay tuned!

English

Tony Wu@tonywu_71·5 Tem

🚀 How can we make sense of visual document at a large scale for retrieval? With our new paper “ColPali”, we propose to leverage VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval: arxiv.org/abs/2407.01449 📝🔍 (1/10)

English

235

33K

Jordan@ZedDou1·24 Tem

@tonywu_71 From my understanding, even though OCR is not needed at indexing time, it would still be required at generation time so we can include it the prompt isn't? Or do you directly use the projected patches embeddings to represent the context the LLM should use to answer the query?

English

Tony Wu@tonywu_71·5 Tem

🧩 Current document retrieval systems are complex and requires multiple steps like OCR, HTR, layout detection, table parsers, text chunking, etc... This process is difficult to set up, prone to errors, and loses subtle visual cues such as font size and text color. (2/10)

English

3.1K

Jordan@ZedDou1·20 Tem

@cwolferesearch And here is the paper arxiv.org/abs/2406.06608

English

Jordan@ZedDou1·20 Tem

@cwolferesearch Just for the sake of sharing, there's been some research saying the opposite though (but I didn't went through all the reading/paper) x.com/learnprompting…

Learn Prompting@learnprompting

🚨 Role Prompting doesn't work... Our team at @learnprompting led a year-long study with co-authors from @OpenAI & @Microsoft, analyzing over 1,500 prompting papers. We narrowed it down to 58 different prompting techniques and we analyzed every one. Here's what we found...

English

Cameron R. Wolfe, Ph.D.@cwolferesearch·20 Tem

Quick and easy tip for better evals: Role prompting is a useful trick for generating text with LLMs, but combining role prompting with LLM-as-a-judge can be a problem... TL;DR: If we tell the LLM judge that it is an "expert" in a certain field within the system message, then the scores outputted by LLM-as-a-judge become much harsher / lower and evaluation quality (i.e., correlation with human evaluation) deteriorates as a result. What is role prompting? This concept refers to a prompt that asks a language model to assume a role; e.g., a math expert, a poet, a pirate, etc. Upon assuming the role, the LLM will tailor its responses accordingly. For example, an LLM told that it is a pirate might say things like "Ahoy", while an LLM told that it is a math expert may perform arithmetic more accurately. Why would we use this? Role prompting is a fun trick that can be useful for testing the instruction following capabilities of an LLM. However, it can also have a non-negligible impact on the quality of the LLM's generations. For example, role prompting an LLM as an expert in a certain field tends to improve the quality of its generations when solving domain-specific problems (e.g., math, reasoning, writing, and more). What is LLM-as-a-judge? LLM-as-a-judge is a simple but powerful evaluation technique that uses an LLM to score or evaluate the output of another LLM. The simplest way to use LLM-as-a-judge is to write a prompt asking an LLM (e.g., GPT-4) to score a model's response to some prompt on a scale from 1-10 (see image below). We can also pass pairs of responses and ask which one is better (i.e., pairwise scoring instead of pointwise scoring). Role prompting + LLM-as-a-Judge. Role prompting is very useful for generating text with LLMs for the reasons mentioned above. However, we should be very careful when using role prompting in tandem with LLM-as-a-judge. In particular, if we tell the LLM judge that it is an "expert" in a certain field (e.g., math or poetry), it will be much harsher when scoring outputs. In fact, many of the scores will be as low as possible (i.e., 1 out of 10)!

English

8.4K

Jordan@ZedDou1·23 Haz

@_xjdr Does your automated eval rely on an LLM-as-a-judge approach with an LLM that demonstrated high level of agreement with you or is it more about evaluating the new LLM on a handcrafted set of "close-ended" tasks?

English

131

xjdr@_xjdr·23 Haz

one of the highest return on effort activities i have ever done is setting up automated evals and benchmarking for any new model i touch. I can no longer imagine living without it

English

7.8K

Jordan@ZedDou1·25 May

@Xianbao_QIAN Thanks for the answer! Does the absorption part means that somehow, the up-projection matrix is embedded into the query vector such that multiplying the query vector by the latent KV vector is similar to computing the regular dot product between query and key?

English

Tiezhen WANG@Xianbao_QIAN·25 May

@ZedDou1 The project matrix can be absorbed into other matrix so the latent KV is not introducing more complexity during inference compared to normal KV Cache. The challenge is how to apply ROPE attention efficiently since it can't be absorbed. There is a modification on ROPE with latent.

English

Tiezhen WANG@Xianbao_QIAN·24 May

Attention layer normally uses KV Cache to reduce repetitive compute, but it consumes significant GPU RAM, limiting concurrent requests. DeepSeek V2 introduces Multi-head Latent Attention (MLA), which stores only a small latent representation, resulting in substantial RAM savings.

English

1.7K

Jordan@ZedDou1·7 May

@BlackHC @arankomatsuzaki From what I've understood, you can either not use them (regular decoding process) or use them in the context of speculative decoding to speed up the inference, the additional heads then act as the draft model in the context of speculative decoding

English

Andreas Kirsch 🇺🇦@BlackHC·3 May

@arankomatsuzaki I guess this is only an auxiliary loss during pre-training and the additional heads are not using during generation

English

121

Aran Komatsuzaki@arankomatsuzaki·1 May

Meta presents Better & Faster Large Language Models via Multi-token Prediction - training language models to predict multiple future tokens at once results in higher sample efficiency - up to 3x faster at inference arxiv.org/abs/2404.19737

English

126

868

182.6K

Jordan@ZedDou1·27 Nis

@yuntiandeng I assume you considered all the non-RTed papers as negative. This may be problematic as we may have false negatives in the dataset, except if he's truly a machine that is able to go through all the published papers (is he?) 😁

English

Yuntian Deng@yuntiandeng·26 Nis

Data: huggingface.co/datasets/yunti… Model: huggingface.co/yuntian-deng/a… I think it's possible to train stronger/more interpretable models. For example, this might be a good application for tree prompting (twitter.com/csinva/status/…)

Chandan Singh@csinva

Little video on how TreePrompt works

English

2.5K

Yuntian Deng@yuntiandeng·26 Nis

Will your paper catch the eye of @_akhaliq? I built a demo that predicts if AK will select a paper. It has 50% F1 using DeBERTa finetuned on data from past year. As a test, our upcoming WildChat arXiv has a 56% chance. Hopefully not a false positive🤞 🔗huggingface.co/spaces/yuntian…

English

180

51.2K

Jordan@ZedDou1·25 Nis

@abacaj @major_katsurAGI @nisten @altryne Could it be done to avoid having a discrepancy in the meaning of the EOS? E.g. if pre-training with packing, the EOS only marks the separation between two sequences (and thus the part before the EOS is not necessarily complete) while it is when doing instruction fine-tuning

English

anton@abacaj·25 Nis

@major_katsurAGI @nisten @altryne a different format is usually used during fine tuning. though in mistral instruct they used the same EOS token as pretrain (</s>). guess it just depends on what the authors were thinking 🤷‍♂️

English

249

anton@abacaj·25 Nis

See some interesting takes here about phi-3. Some say it sucks (was trained on test set!?), others say it outperforms 70B models. I don’t know what it was trained on, but it’s pretty easy to take the weights and try the model yourself - can even try it without a GPU. I’m finding it pretty useful for extraction tasks

English

147

75.6K

Keşfet

@JinaAI_ @TejasReddyS1 @victorialslocum @AnthropicAI @michael_g_u @arankomatsuzaki @tonywu_71 @bclavie