Twitter shut off API access; users are now volunteering their data for an open API

A lot of community applications died when API access to twitter, reddit & others was severely restricted, but they can't stop us from sharing our own data with each other!

Sep 18, 2024

There are ~1 million tweets in a publicly accessible dataset, volunteered by users. It’s hosted on community-archive.org, here’s the GitHub repo. Anyone can build useful tools on top of this dataset, commercial or otherwise.

I’ve been yearning for this for a long time, and it only just now occurred to me: you don’t have to trick or coerce users into giving up their data, you can just ask them nicely?? They will share it as long as doing so benefits them.

Everyone who has contributed so far is benefiting because we want to build on top of this data (and we can monetize those tools, we’re upfront about this with each other), OR answer questions about ourselves & our communities, OR build the UIs we want for ourselves that are not currently possible.

The next wave of people who contribute will do so because they get genuine benefit from that ecosystem of open tools. If they only want to analyze their own tweets, they can do so privately. If they want to get insights about cultural trends in their community, or map out their network & their friends’ friends network, they can make it happen.

EDIT: as an update to this story, I found out the Washington Post is doing something very similar, asking its users to export & share their user data. The data is private, but the analysis is public.

Request for feedback

Has anything like this been done before? What went well & what went wrong? There's a Discord where the dev’s hang out, and a lot of the discussion is also happening publicly in GitHub issues.

For example, here’s a proposal for how to make the data publicly available as cheaply as possible (put everything in S3, behind a CDN). Is it OK that there’s no auth required at all to fetch this data?

My personal vision

(1) run the archive pipeline as minimally as possible + first class support for self hosting

The archive itself should be targeted to developers, with a focus on the API, where the data is, and how to contribute. I think should be very similar to OpenAddresses.io.

homepage of Open Addresses for inspiration

Self hosting is important because this only works if people trust that no one entity is in control of the data. Also, user data is very sensitive, communities may want to develop collections that are invite-only. This is still a net win for the movement.

These private communities:

could publish “artifacts” from the raw data, like an anonymized subset, or summary statistics
could become open later on, or merge with other private communities
can still share improvements in the open tools & pipeline even if the data is closed
you can make a commercial product that works on this data that communities can pay to use

Tweetback is a tool for self hosting your tweet archive & making it searchable. I consider this part of the same ecosystem, it would be cool to have a collection of links to people who self host.

(2) make it super easy to access the raw data, maybe a hackathon?

A user who doesn’t know how to code can take a URL that leads to their (or anyone else’s) tweets.json, and ask their LLM, “generate vanilla JS/HTML code to visualize this as a grid” or “a timeline” and poof, they have their own personal UI. They can now publish this, or share it so others can build on it.

People can build games with this, like:

“Banger or not” - given a tweet, can you guess whether it has < 100 views, or over a million? (they say that it’s random what goes viral, but people who are very good at it beg to differ)
Multiple choice quiz: given a graph of emotional sentiment across time with various spikes, can you guess what political or community event happened that day?

These games are fun but they’re ALSO insightful. It’s a fun way to learn the history of whatever community’s tweets you have. For example, if you convince enough computer graphics people to share their archives, or fintech people, what interesting industry trends could you watch evolve, or trace how they originated?

(3) offline first tools can help it spread

It would be cool if some of these visualization or analysis tools can work WITHOUT having to upload your data to a public archive. Like, drag & drop the zip file twitter gives you and do the stats in the browser. OR a tool that lets you upload just the vector embeddings, not the tweets themselves (that way your data is private but you can still find people whose writing most closely matches yours, and you can go talk to them, or find clusters in our communities etc).

This would reinforce that the project’s goals isn’t inherently to collect data as much as it is to create value & ownership of our data. People getting used to analyzing & building with their own data is a net win for everyone. It makes you see the possibilities and gives you a higher bar for what you demand of your commercial products.

You also don’t want people to share their data just because it benefits others, you want to show them what value *they* can get, and if they’re convinced they become your biggest evangelists.

(4) get specific people on board, community survey to see whose archives are most requested?

Patrick McKenzie (patio11) has 60k posts on twitter. He posts a lot of incredible “inside scoop” stories about tech & finance that regularly make the front page of HackerNews. What if I could ask patio’s archive: “what are some good books to read about [topic]” or “what advice would you give to someone trying to get a job at Stripe”

Patrick has answered these exact questions many times. I don’t want an LLM to make up an answer, I want to do fuzzy/semantic search & *find* the answer.

Imagine if I could ask: “have patio’s thoughts on bitcoin changed over the years? show me his most positive and negative tweets about bitcoin, graphed over time”

We don’t need to convince twitter to make this happen. We just need to convince the people who create value in our communities to export their data. We can make it happen ourselves.

It would be a win-win: patio wouldn’t have to keep answering the same questions over & over, and a lot more people would benefit from the wealth of knowledge he’s already put out publicly (especially now that you can’t even read twitter threads without being logged in).

(5) apply for funding? from where? Grants? Mozilla?

There’s going to be a bunch of hosting costs as this thing grows, it would be cool to also compensate the developers working on this. I imagine it like a similar structure to Open Street Map perhaps. The data is fully open, and CAN be commercialized, but the developers don’t need to monetize their work directly to make it sustainable (it’s sustained because a lot of commercial interests rely on it?)

I like the story of Open Street Map because you basically had one platform (Google) that invested billions in map data, but wouldn’t share it. Other companies wanted to but failed to compete with this (Apple, Microsoft, Facebook). The way they succeeded was in collectively funding the open data that took away this monopoly. The result of this fight is that the big companies benefited, but so did the general public. We now have a global, freely available dataset of every building in the world that anyone can use for anything from non profit/scientific work to commercial applications.

It’s win win. Even Google benefited from this because now the winning product is the best product, not the one that holds the data hostage. Competition forces innovation, and innovation is how companies (& society) thrive.

I can imagine companies like Mastodon & Blue Sky would be interested in these efforts. It’s getting the average user excited about why open data matters, which is the whole promise of these platforms. Maybe people can build a simulation of alternative feeds, like what WOULD twitter look like if you had control over the levers of your algorithm.

Like imagine if you could sort your timeline or any discussion thread by tweets that show “curiosity”, “good faith argument”, or split any discussion thread into “pro” and “against” posts, and cluster by similarity.

Andy Matuschak was going to build this but he gave up, because of the lack of an open API:

https://x.com/andy_matuschak/status/1780745271644914172

If there are communities that really hate it on twitter but are stuck due to network effects, they can use this archival pipeline to move everything, along with the history, follower network, likes etc. Long term this is good for everyone, including twitter (competition & innovation).

Thanks for reading!! I’d love to hear any thoughts you have any this (especially if you’re skeptical). The more I think about this, the more I want it for all communities I am a part for.

For example, my local r/ithaca subreddit has people asking the same questions over and over (where’s the best place in town for XYZ). There’s some efforts for a community FAQ but this always goes out of date. I think when we start to do this, making regular snapshots will become easier, or companies will eventually relent and open up APIs again, and then our work here will be done.

Omar’s Writing

Discussion about this post