# RSS In The Wild
#HTTP
#RSS
#small web
I recently came across [Kagi Small Web](https://kagi.com/smallweb) after I started using [Kagi](https://kagi.com/) as my primary
search engine. From their launch [blog post](https://blog.kagi.com/small-web):
> “small web” typically refers to the non-commercial part of the web, crafted by individuals to express themselves or
> share knowledge without seeking any financial gain. This concept often evokes nostalgia for the early, less
> commercialized days of the web, before the ad-supported business model took over the internet
To be included on the list you have to meet certain criteria, one of which is to have an RSS/Atom feed of the content.
When I created the RSS feed for this site I searched for the best practice, RSS vs Atom, which content-type header to
use, etc, etc.
So when I discovered that the list of sites is available on GitHub at [kagisearch/smallweb](https://github.com/kagisearch/smallweb/)
I wondered what conclusions everyone else came to...
## Scraping
I wanted to scrape both the HTTP headers and body for all the feed URLs. I threw together a quick bash script which ran
the curl requests in parallel.
I chose not to follow any redirects, forced all URLs to use HTTPS and set a hard timeout of 3 seconds.
Of the 14,513 URLs 12,929 returned HTTP 200, all other responses were discarded. The data was then passed through some
gnarly grep one-liners to produce the graphs below.
A total of 3,024 MB was downloaded.
## RSS vs Atom
```php
//[eval]
echo new \Pierresh\Simca\Charts\BarChart(700, 300)
->setSeries([[9029, 4019]])
->setLabels(['RSS', 'Atom'])
->render();
```
Show code
```bash
grep --no-filename -EiRo '<(rss|feed)' bodies/ | sort | uniq -c | sort -rn
```
## Content-Type Header
There are many different content types which can be declared for a feed, which is most common?
```php
//[eval]
echo new \Pierresh\Simca\Charts\BarChart(700, 600)
->setSeries([[4375, 837, 255, 223, 24, 20, 19, 3, 2, 1]])
->setOptions([
'labelAngle' => 45,
])
->setLabels([
'application/xml',
'text/xml',
'application/rss+xml',
'application/atom+xml',
'application/octet-stream',
'text/html',
'application/x-rss+xml',
'text/plain',
'application/rdf+xml',
'binary/octet-stream',
])
->render();
```
Show code
```bash
grep --no-filename -EizR "http[0-9\/\.]+ 200" headers | tr '\0' '\n' | grep --no-filename -Eio '^Content-Type:\s[^;]+$' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n 10
```
## Charset
Not all sites included a charset for the feed, but when they did what was is it set to?
```php
//[eval]
echo new \Pierresh\Simca\Charts\BarChart(700, 400)
->setSeries([[7146, 4, 4, 1]])
->setLabels(['utf-8', '"utf-8"', 'iso-8859-1', 'utf8'])
->render();
```
Show code
```bash
grep --no-filename -EizR "http[0-9\/\.]+ 200" headers | tr '\0' '\n' | grep -Eio '^Content-Type:.+$' | grep -Eio 'charset=[^;]+$' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n 10
```
## Path
What is the path to the feed?
Trailing slashes were stripped before aggregation.
```php
//[eval]
echo new \Pierresh\Simca\Charts\BarChart(700, 400)
->setSeries([ [4533, 2199, 1296, 929, 914, 765, 669, 144, 140, 126, 101] ])
->setLabels([
'/feed',
'/feed.xml',
'/index.xml',
'/rss.xml',
'/feeds/posts/default',
'/rss',
'/atom.xml',
'/feed.rss',
'/blog/feed',
'/feeds/all.atom.xml',
'/blog/feed.xml',
])
->setOptions([
'labelAngle' => 45,
])
->render();
```
Show code
```bash
grep -Eoi '\.[a-z]+/.+' smallweb.txt | grep -Eio '/.+' | sed 's/\/$//' | sort | uniq -c | sort -rn | head -n 10
```
## gTLD Domain Choice
```php
//[eval]
echo new \Pierresh\Simca\Charts\BarChart(700, 400)
->setSeries([[8136, 930, 787, 694, 511, 403, 241, 213, 140, 130]])
->setLabels(['.com', '.net', '.org', '.io', '.dev', '.me', '.blog', '.co.uk', '.de', '.xyz'])
->render();
```
Show code
```bash
grep -Eoi '\.[^/]{2,7}/' smallweb.txt | sed 's/\/$//' | sort | uniq -c | sort -rn | head -n 10
```
## Web Server
```php
//[eval]
echo new \Pierresh\Simca\Charts\BarChart(700, 400)
->setSeries([[2585.0, 2500, 1599, 1244, 1008, 872, 380, 363, 326, 325]])
->setLabels(['nginx', 'cloudflare', 'github.com', 'apache', 'netlify', 'blogger-renderd', 'caddy', 'esf', 'vercel', 'openresty'])
->setOptions([
'labelAngle' => 45,
])
->render();
```
Show code
```bash
grep --no-filename -EiRo '^server:.+$' headers/ | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n 10
```
## Categories
I parsed out all the `{html}` nodes from the RSS feeds.
The number of categories per feed, case-insensitive.
```php
//[eval]
echo new \Pierresh\Simca\Charts\BarChart(700, 400)
->setSeries([[209, 109, 82, 62, 46, 40, 32, 30, 20, 21, 24, 32, 7, 11, 18,]])
->setLabels(['1-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100', '101-110', '111-120', '121-130', '131-140', '141-150'])
->setOptions([
'labelAngle' => 45,
])
->render();
```
Most common categories across all feeds. Case-insensitive, each category was counted only once per feed.
```php
//[eval]
echo new \Pierresh\Simca\Charts\BarChart(700, 400)
->setSeries([[193, 193, 191, 179, 178, 163, 156, 149, 147, 146, 132, 124, 123, 122, 119]])
->setLabels([0 => 'python', 'books', 'music', 'linux', 'politics', 'history', 'programming', 'science', 'security', 'art', 'video', 'education', 'technology', 'ai', 'writing'])
->setOptions([
'labelAngle' => 45,
])
->render();
```