The true cost and code of parsing the integrality of (french speaking) bluesky ATPROTO in python

I was reading this news on ycombinator and was flabbergastered by the affirmation of people regarding the cost and complexity of parsing the integrality of bluesky.
>I suspect that the cost of running AT proto servers/relays is prohibitive for smaller players compared to a Mastadon server selectively syndicating with a few peers, but I say this with only a vague understanding of the internals of both of these ecosystems.
NB the news is about contradicting this assertion in terms I don't understand.

Well suspecting is nice, but what about reality ?

I actually run from my family outdated PC a full scan in realtime of bluesky with some python code. And then later :
- AppViews are actual "application backends". Bluesky operates the bsky.app appview, i.e. what people know as the Bluesky app. Importantly, in ATProto, there is no reason for everyone to run their own AppView. You can run one (and it costs about $300/mo to run a Bluesky AppView ingesting all data currently on the network in real time if you want to do that).


Ok, I run without extra costs other than electricity and my everyday life FTTH (common in social housing) my bot.

It actually takes :
  • 25% of CPU
  • less than a third of the domestic bandwidth
  • on a bi proc core i3 with multithreading disactivated
  • on a standard mint distribution
  • using 640Mb of memory

I may not be smart, but the bot actually runs, so without being a specialist, I can assure you not being the sharpest knife in the drawer, you can also do your own atproto bot at home without investing 300$/mo, and in python.

So here is my feedback on making a bot in python that can scan the whole bluesky (sort of) and how much volumetry it is, and actual code and hints to do it.



Volumetry

My bot @trollometre.bsky.social in order to see the most reposted skeets of bluesky in french I must inspect them all with a rate limit of 10 request per seconds.
But scanning the firehose is free. The rate limiting applies on requests such as get_post that I use intensively.
If you look at the volumetry per events of bluesky available here you will notice that posts events are fairly few compared to the totality (like events being the most common) :

With 10 events per seconds of free rate limiting with 50 posts per seconds, it seems scanning the whole bluesky will be tough. (Spoiler alert if you want a fair sampling I will hint on how I experimentally achieved it).

Posts events coming from the firehose are complete and they have a langs field. Hopefully for scanning each posts in french and doing a rate limited scan I have far fewer events per second : 2-3% of the mass, less than 1 post per second. Hence, my bot rely heavily on posts.

Spam and blocked ? What are these ?

Around 10% of the traffic of actively posting every day are showing their piece of flesh that are symptomatic of them being mammals and if not blocked represent two third of the most posted traffic.

For you it maybe ok, for me who is not very found of these if diminishes the signal/noise ratio of what 90% of users wish to convey thus, I decided to block them.

Also, active users representing ~25% traffic of the community in the reminder of the french speaking community (664) may not have been favorably impressed by the initial wording of my bot and its name.

Well, it cannot be helped. I always had terrible taste for naming my projects.

By using the tagging of bluesky as porn and building a blacklist out of it I consolidated a 95% efficiency filter. As my detector tend to show, I converge towards a full list.


Let's talk about coding



Resources



First you don't dive into code without resources. Normaly I shoud direct you towards the official one but ... I hardly can read it.

I did most of the bot by doing frankencode : copy pasting from the python API example.

Then I discovered, having access to the API is quite nice, and I think the source code of the client is clear.



Scanning randomly the whole bluesky without burning your rate limit



I made a mistake at the beginning by taking a non multi worker example of the firehose, and achieved serendipity : by having the event loop and the worker without multiprocessing you starve your workers and seem (must be reproduced to be sure) to actually let you starve your worker under the rate limit.

Websockets ?



I tried the websocket firehose but either I did something wrong or it is unreliable.

The code (insufficiently documented and non PEP8 compliant)

Especially the part on how to run the moving parts.

The database structure is pretty thin : it has one table.

The main part of the code is here : trollometre.py

It is a classical (according to the example) multiprocessing architecture.
Let's tacke some features :

a webscoket interface to administer the HAM/SPAM classification



The main code embed a websocket server and there is an html page as a client.

This web page has code actions for each posts published that are HAM (to tag as a normal content), SPAM (to tag as spam), and (POST).
In order to work you need the flask backend to run.

The learn stuff



Natural language processing is fun, so I added a spam detection as a last line of defense that is built with learn.py.

Score setting



I decided to aim at a pretty constant number of posts per day, the heart of the score setter is an independent process that is here.

Plotting



Once the rrd archive created, nothing beats a perl one liner in a perl one liner to transform the data in the CSV into a proper rrd graph.

Conclusion



I am always feeling an impostor because I cannot speak as loudly and vehemently with technical words as people on ycombinator or losters do.

I write « toy code », that I can show, with graph of it actually RUNNING in my dining room.

And from the experience of my toy code under a free software licence that actually works since 2 months on a random PC that is not fancy, I think YOU can probably give a try at atproto/Bluesky API.

I would pretty much advise to throw my code and starts from the example of MarshalX.

No comments: