Scraping a jazz events calendar

As mentioned in my last post building a live music calendar, I’m disappointed that the websites that list jazz events in Boston don’t offer the data as an RSS or iCal feed. One example of this is the WGBH Jazz Calendar, which has probably the most comprehensive listing of jazz events in the Boston area.

In my talk about Plone4Artists at EuroPython 2005, I mentioned a tool called Scrape ‘n’ Feed, which will scrape a website and generate an RSS feed. Well, it’s been a year since I first discovered this tool, and now I’m revisiting it to see if I can make it work. Here is my first foray into this scraping business.

ScrapeNFeed depends on Beautiful Soup and PyRSS2Gen which are easily installable on Ubuntu Linux with:

apt-get install python-pyrss2gen
apt-get install python-beautifulsoup

Once I installed these two packages, I downloaded the ScrapeNFeed.py script and created the following file ‘getwgbhfeeds.py’:

#!/usr/bin/env python
import BeautifulSoup
from PyRSS2Gen import RSSItem, Guid
import ScrapeNFeed

class WGBHFeed(ScrapeNFeed.ScrapedFeed):    

    def HTML2RSS(self, headers, body):
        soup = BeautifulSoup.BeautifulSoup(body)
        eventTable = soup.firstText('Sort By:').findParent('table')
        tds = eventTable.fetch('td',{'class':['searchres', 'searchres_off']})
        items = []
        for item in tds:
            link = item.findNext('b')
            eventLink = self.baseURL + link.a['href']
            if not self.hasSeen(eventLink):
                eventTitle = item.a.string
                eventDate = item.contents[0].strip()
                eventLocation = item.contents[5].strip()
                items.append(RSSItem(title=eventTitle + '(' + eventDate + ')',
                                     description=eventLocation,
                                     link=eventLink))
        self.addRSSItems(items)

WGBHFeed.load("WGBH Concerts",
                 'http://www.publicbroadcasting.net/wgbh/events.eventsmain?
action=showCategoryListing&newSearch=true&categorySearch=5596',
                 "See all the jazz concerts posted to the WGBH calendar",
                 'wgbh.xml',
                 'wgbh.pickle',
                 managingEditor='name@domain.com (First Last)')

Run the script with ./getwgbhfeeds.py and it will output a file wgbh.xml , which is in the RSS 2.0 format. You can then open this file using your RSS reader of choice, and view all the Boston jazz events.

Once thing that I noticed is that some of the items in the list have an extra <br /> which means the title doesn’t get read in correctly. I’ll have to find a way to ignore the <br /> which I sure will be fairly simple with BeautifulSoup.

What’s next

At the OPMLCamp a few weeks ago, I met Mike Kowalchik, the creator of grazr. After seeing this tool, I immediately thought about how useful it would be for generating a browseable directory of event listings. You simply supply grazr with an OPML file, and it will then display all the RSS feeds and their entries. After I get a couple more event listing sites scraped, I’ll generate the OPML file and try them out with grazr.

Mike also mentions on his blog about Tom Morris’ idea about using grazr to ‘kill myspace’ by creating a better way for independent bands and artists to self promote using OPML. Note to self: follow up with Tom to discuss this idea further. I love the integrated MP3 player in his grazr box. Update: left him an Odeo message.

Technorati Tags: , , , , , ,

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • LinkedIn
  • MySpace
  • Reddit
  • Slashdot
  • TwitThis

1 Response to “Scraping a jazz events calendar”


  1. 1 Marc

    I’ve used WWW::Mechanize in the past and it worked nicely. I used the Perl module but there is definitely a Ruby module as well, and perhaps a Python binding somewhere as well.

Leave a Reply