Search Ask

Makeroom

RegisterLogin

Discussion

General
Tech

Library

Chevron Right Icon
Design
Resources
Websites
Reve: Reimagine Reality
Chevron Right Icon
Web development
Cool Libraries
Tools
Resources
Papers and Studies
Articles
Language Models
Tech and Systems
Chevron Right Icon
Computers
Chevron Right Icon
Windows Tools and Modding
Windhawk
Raycast for Windows
Rainmeter
Haiku: BeOS-Inspired Open-Source OS
Chevron Right Icon
Random fun stuff
Esoteric File Systems
Cool websites
Chevron Right Icon
Friends
Unity - Cheaterman's Bar
Chevron Right Icon
Storyden
Selfh.st
OpenAlternative
Microlaunch
Peerlist
Glama.ai
AlternativeTo
Brandfetch
Dokploy
PitchHut
Piefed Social
 Collections Links Members Roles

Makeroom

Icon

A small rag-tag assortment of makers, engineers and designers sharing mentoring, support and projects to work on at any stage in their career.

Join our Discord server!


Welcome to the Makeroom installation of Storyden!

This acts as a live demo of Storyden's forum and library software. On this site you'll find a curated collection of web and design resources as well as anything our members share.

Feel free to participate, this may be a demo but it's never wiped. That being said, Storyden is in active development and we encourage you to experiment respectfully as well as report any security issues you find to @Southclaws or by opening an issue.

Have an amazing day!

powered by storyden

Login
Library
screenai-a-vision-language-model-for-ui-and-infographics-understanding

No versions or drafts yet.

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

arxiv.org