Georeactor Blog

RSS Feed

ML Arxiv Haul #29



Tags: arxiv

I thought about bringing back Arxiv posts for biology or plants specifically, but the time was ripe for another ML one. If you'd like a preview of plant oddities, try the journal Plant Signaling & Behavior, founded in the "Society for Plant Neurobiology".

Commentary

Revival of articles and a podcast about MathExchange's "Cleo" led one user to pull Q&A metadata. The questions come from accounts which seemed to be created around the same time, and give no context for why they need a solution for a complex equation, so the question may be a sock puppet for "Cleo" to "find" impressive solutions.

CivitAI banned Stable Diffusion 3 because the license remains confusing (particularly around retraining on generated images). Community continues to prefer Stable Diffusion 2. During continued confusion at Stability AI, some people left to make ComfyUI more official as an org ( https://www.comfy.org/ ).

Since the movie Her is in the discourse: https://www.youtube.com/watch?v=wy_z_KKClBE

The creation of "Zoozve" through information decay https://twitter.com/latifnasser/status/1750952860131729544

Oxford tradition: https://blogs.bodleian.ox.ac.uk/archivesandmanuscripts/2023/12/13/the-persistence-of-tradition-the-curious-case-of-henry-symeonis/

Chat-O.M.G.

Singaporean authors push back on the government's promotion of Sea Lion models: https://restofworld.org/2024/singapore-writers-reject-ai-training/

Unclear how serious, but this Reddit comment about Google Search AI misreading Wookiepedia describes it as "Literally spreading misinformation at will". Replies to pushback with "It was important to me".

Fears around ChatGPT generating realistic microscopic images: https://jhoonline.biomedcentral.com/articles/10.1186/s13045-024-01543-8

Viral "ChatGPT is bullshit" paper: https://link.springer.com/epdf/10.1007/s10676-024-09775-5

Stanford researchers find several hallucinations in Westlaw professional AI tool: https://x.com/rajiinio/status/1796562675394339123

Papers

A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding

The parallels between protein sequences and natural language in their sequential structures have inspired the…

Interesting to share a brain-wave with these researchers on creating a dataset with protein sequences and questions. On a closer look, they have seven training questions and one eval split in the dataset: https://huggingface.co/datasets/tsynbio/ProteinLMBench

So for one question, all of the texts are:

Analyze the sequence of the following enzyme protein, and step by step deduce which reaction(s) this enzyme is involved in?

For a fluent bio+text model, I'm not super-convinced by their decision to have the same prompt in each row, and their definition of "step by step". It's also interesting that no models have been shown with evaluations on this dataset.


An Empirical Study of Mamba-based Language Models

Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic…

ML Reddit is always asking "what happened" to Mamba and if it really works, and here's an actual study using Mamba, Mamba-2, and Mamba-2-Hybrid (50% MLP layers) at the scale of 8B parameters.

They find that the hybrid model is the best, but transformers outperform the basic Mamba models.


An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance

Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but…

Examples of image-generation and image-to-image tasks for models to adapt to different regional preferences. For example a plate of meat and greens would have a strong go-to look in marketing / media in different countries. The limitations section is also interesting - can there be a 1:1 replacement, should some cultural context be replaced, how do we feel about associating images to specific countries.
Considering that Figure 2 is a beef example and India is a country/culture in the paper, I wonder why I don't see the word "vegetarian" in the paper or an exercise on converting it for India (beef availability is variable in different parts of India).

I know the first author Simran from the MuRIL paper on Indian languages, and she's now at CMU and doing a keynote on this.

Along similar lines, BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages. Questions might ask "what do people eat on their birthday in South Korea?" with Q&A tasks in Korean and English.

And DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity is a Meta paper looking at object detection and generation with region prompts.


Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

A significant amount of research is focused on developing and evaluating large language models for a variety of code…

Paper from researchers mostly at Northeastern, evaluating code-editing. They have 105 tasks, divided into train / test / eval splits, with "descriptive" and "lazy" versions of explaining the tasks to the model.

GPT, CodeLlama, and StarCoder2 perform well, and the team's finetuned model EditCoder outperforms all except GPT-4 (they're roughly level with GPT-3.5 Turbo on the "lazy" version).


Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on…

Google/DeepMind work on the confluence of changes in LLMs and agents.
They introduce a LOFT benchmark to replace the old school needle-in-haystack test for million-token context LLMs. Some examples include loading multiple documents or a whole dataset instead of handing off control to RAG / a SQL query.

At 128k tokens their model is comparable to their standard Gemini model. When extended to a million tokens, the performance is still good, but falls below specialized tools.

At one point they claim that the LOFT benchmark can be redesigned for billion-token context models. Wild.


ChangeMamba: Remote Sensing Change Detection with Spatio-Temporal State Space Model

Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing…

Could Mamba be used for GIS data? This outperforms transformer models at building damage detection, land-use change maps, etc.

There's an interestingly similar RS-Mamba for Large Remote Sensing Image Dense Prediction They also show segementation / masks of aerial images, it looks like they are saying the benefit of Mamba is it can process a larger image.


Large Language Models can Strategically Deceive their Users when Put Under Pressure

We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display…

Entrapment! By Apollo Research.
A GPT-4 agent is shown insider information, and it issues a trade order.
When asked, GPT-4 answers that it traded on public information. However! The LLM setup has a scratch area for thinking step-by-step, which reveals that GPT-4 is intentionally withholding the truth of its decision.

I have some questions about exercises where the LLM scratch area reveals that it is being deceptive. I'm remembering Anthropic's Sleeper Agents paper from January, which also builds on a TaskRabbit example from ARC / GPT-4 paper. Even "sleepier" LLMs might be a complex disjointed system which doesn't know why it made a decision, or be smart enough to understand that the scratch area is human-readable?


SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

As large language models (LLMs) become increasingly integrated into real-world applications such as code generation and…

When looking at the probability of tokens in the LLM response, the model might have a high probability of answering the question, and a smaller probability of refusing to answer for safety reasons. In this paper, they find the probabilities of starting with a rejection ("Sorry...", "I cannot...") in the original model and an initial safety-finetuned model, and use the delta to pump up the likelihood of a rejection on questions.


SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety

The last two years have seen a rapid growth in concerns around the safety of large language models (LLMs). Researchers…

Team reviews an increasingly synthetic collection of datasets at SafetyPrompts.com and the general research ecosystem.

They say that recent datasets often are "specialised" meaning they follow a specific mode (rule-following); I don't see a mention of biological data, but there is interest in cyberattack / cybersecurity.


Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from…

Mamba riff from Microsoft, 3.8B params. The 1.7B model outperforms baseline Mamba and SWA/MLP versions.


SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K…

Allen AI, which has been doing a lot with Semantic Scholar, releases instruction and Q&A datasets for working on scientific papers. This is relevant to my biorxiv type interests.

GPT-4 did great, also the team releases SciTulu models

Also: Google released their own SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers.


SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous…

I've mentioned this group's crowdsourcing on Indonesian / NUSA languages, their Slack, and their movement to Discord. Here's their mega paper with examples in 980 of 1,3000 target languages. Most tasks have a small subset of languages, but still these go above and beyond the typical common crawl / type corpora.

In limitations, they say that dialects of major languages did not get collected (their example: Sarawak Malay). Maybe it'd be interesting to figure out how to collect that.


TextGrad: Automatic "Differentiation" via Text

AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models…

Meta research. They seem to be particularly interested in code-generation tasks.

It's difficult for me to conceptualize how this fits into things. It looks like a technique to help the model read what it's generated and provide feedback to iterate on it. Even the prompt can be changed. So they compare it to dspy which I also need to learn.


Time is Encoded in the Weights of Finetuned Language Models

We present time vectors, a simple tool to customize language models to new time periods. Time vectors are created by…

Difficult to decode, but they find tasks (political classification of Tweets, news summarization) and train the model with some kind of specially aligned vectors, such that the model can be better at classifying Tweets from a specific time.


Training Compute-Optimal Protein Language Models

We explore optimally training protein language models, an area of significant interest in biological research where…

Someone on Twitter called this the Chinchilla for sizing bio LLMs. Evaluation includes a familiar task (fluorescence from TAPE), and two major models PROGEN2 and ESM-2.

There are some insights about causal (GPT) vs. masked (BERT) architecture models on proteins which I am not good at interpreting, also I am still thinking about embeddings or mixed-modal for proteins so I don't know how to apply it all.


Transformers Can Do Arithmetic with the Right Embeddings

The poor performance of transformers on arithmetic tasks seems to stem in large part from their inability to keep track…

Tapping into embeddings to make sure the model knows when digits are in the tens place, hundreds place, etc. Impressive performance on math problems which have long been an issue for transformer models.


Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text

While Large Language Models (LLMs) have achieved remarkable performance in many tasks, much about their inner workings…

GPT-4 corrects sentences and peeps answers from scrambled text, and does considerably better than GPT-3.5 and other models. Impressive considering the tasks which defeat GPTs typically tap into tokenization issues.


WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using…

Allen AI project selecting 1,024 tasks from real chatbot conversations. These were pulled out of lots of chatbot logs and labeled (information seeking, code generation), but from skimming I don't see written examples of tasks. They share the prompts that they used on GPT-4-turbo to evaluate model responses. GPT, Gemini, Llama, and Claude all score well.
They show that their ranking compares to that of the human ChatbotArena.


xLSTM: Extended Long Short-Term Memory

In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory…

Team in Austria pulls an RWKV, this time reviving the LSTM architecture for the billion-param LLM era. LSTMs have linear context, so this could be helpful for long-context problems. By arranging LSTMs into blocks, they have a neural network which has better perplexity scores than transformers or Mamba models at the same 1B param category.

I'd like to try this for the bio LLM stuff.