Checking for Stale RSS feeds in Python

So while I like RSS feeds, one issue I have noted over the years is that people/companies change the url for the feed. This ends up being a silent error in most feed readers (I swear the google reader had some metrics for this, but that was so long ago).

I have attempted to write code to check for not working feeds at least a few different times and given up. RSS feeds are quite heterogeneous if you look closely. Aliquote had his list though that I started going through to update my own, and figured it would be worth a shot to again filter out old/dormant feeds when checking out new blogs.

So first we will get the blog feeds and Christophe’s tags for each blog:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
import ssl
import warnings # get warnings for html parser
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# aliquotes feed list
feeds = r'https://aliquote.org/pub/urls~'
gcont = ssl.SSLContext()
head = {'User-Agent': 'Chrome/104.0.0.0'}
response = urlopen(Request(feeds,headers=head),context=gcont).read()
list_feeds = response.splitlines()

I do not really know anything about SSL and headers (have had some small nightmares recently on work machines with ZScaler and getting windows subsystem for linux working with poetry/pip). So caveat emptor.

Next we have this ugly function I wrote. It could be cleaned up no doubt, but RSS feeds can fail (either website down, or the url is old/bad), so wrap everything in a try. Also most feeds have items/pubdate tags, but we have a few that have different structure. So all the if/elses are just to capture some of these other formats as best I can.

# Ugly Function to grab RSS last update
def getrss_stats(feed,context=gcont,header=head):
    try:
        req = Request(feed,headers=header)
        response = urlopen(req,context=context)
        soup = BeautifulSoup(response,'html.parser')
        all_items = soup.find_all("item")
        all_pub = soup.find_all("published")
        all_upd = soup.find_all("updated")
        if len(all_items) > 0:
            totn = len(all_items)
            if all_items[0].pubdate:
                last_dt = all_items[0].pubdate.text
            elif all_items[0].find("dc:date"):
                last_dt = all_items[0].find("dc:date").text
            else:
                last_dt = '1/1/1900'
        elif len(all_pub) > 0:
            totn = len(all_pub)
            last_dt = all_pub[0].text
        elif len(all_upd) > 0:
            totn = len(all_upd)
            last_dt = all_upd[0].text
        else:
            totn = 0 # means able to get response
            last_dt = '1/1/1900'
        return [feed,totn,last_dt]
    except:
        return [feed,None,None]

This returns a list of the feed url you put in, as well as the total number of posts in the feed, and the (hopefully) most recent date of a posting. Total numbers is partially a red herring, people may only publish to the feed some limited number of recent blog posts. I fill in missing data from a parsed feed as 0 and ‘1/1/1900’. If the response just overall was bad you get back None values.

So now I can loop over the list of our feeds. Here I only care about those with stats/python/sql (not worrying about Julia yet!):

# Looping over the list
# only care about certain tags
tg = set(['stats','rstats','python','sql'])
fin_stats = []

# Takes a few minutes
for fd in list_feeds:
    fdec = fd.decode().split(" ")
    rss, tags = fdec[0], fdec[1:]
    if (tg & set(tags)):
        rss_stats = getrss_stats(rss)
        rss_stats.append(tags)
        fin_stats.append(rss_stats.copy())

These dates are a bit of a mess. But this is the quickest way to clean them up I know of via some pandas magic:

# Convert data to dataframe
rss_dat = pd.DataFrame(fin_stats,columns=['RSS','TotItems','LastDate','Tags'])

# Coercing everything to a nicer formatted day
rss_dat['LastDate'] = pd.to_datetime(rss_dat['LastDate'].fillna('1/1/1900'),errors='coerce',utc=False)
rss_dat['LastDate'] = pd.to_datetime(rss_dat['LastDate'].astype(str).str[:11])

rss_dat.sort_values(by='LastDate',ascending=False,inplace=True,ignore_index=True)
print(rss_dat.head(10))

And you can see (that links to csv file) that most of Christophe’s feed is up to date. But some are dormant (Peter Norvig’s blog RSS is dormant, but his github he posts snippets/updates), and some have moved to different locations (such as Frank Harrell’s).

So for a feed reader to do something like give a note when a feed has not been updated for say 3 months I think would work for many feeds. Some people have blogs that are not as regular (which is fine), but many sites, such as journal papers, something is probably wrong if no updates for several months.

Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: