User:Ryos/Watchlist RSS feeder

From Wikipedia, the free encyclopedia

This code is a modified version of Sylvain Schmitz's PHP RSS feed. His code and my changes are both released under the GFDL.

I wanted an RSS feed of my watchlist almost immediately after joining Wikipedia. Unfortunately, the Wikipedia software doesn't support that feature. It seems that many wikipedians have been hacking away at the problem, but none of the solutions on that page worked for me. Sylvain Schmitz's was close, but I didn't want to have to post it to a (public) webserver (and I don't speak french). So, I started hacking away at Sylvain's code to see if I could make it do what I want. This is the result; may you find it useful.

What you'll need to make it work[edit]

  1. An RSS reader that has the ability to subscribe to scripts on the local machine. The only reader I know of that can do this is NetNewsWire, and it's Mac only.
  2. An installation of PHP 4.3.0 or later, properly configured to run command-line PHP scripts. Mac OS X comes so configured out of the box; I don't know how to do it on Windows (see the PHP manual).
  3. There is no step 3.

How to make it work[edit]

  1. Copy the code below to a text file and name it with a .php extension. When saving, keep in mind that the script saves a cookie file in the same directory.
  2. Configure the script. Set the following variables below:
    1. $wp_name = 'yourname';
    2. $wp_password = 'yourpass';
    3. $wp_tmz = 'your timezone offset from GMT'; //ex: -06:00 (that's my zone)
    4. $script_name = "the name you gave the script.php"; //This is used to name the cookie
  3. Make the script executable. To do this, open the terminal and type "chmod +x " (without the quotes). Then, drag the script file to the terminal window and press return.
  4. Subscribe to the script in NetNewsWire. Make sure to change the script type from "Applescript" to "Shell Script".

Known issues[edit]

  1. As of this writing, the parsing is somewhat broken. This is no doubt due to changes to the page structure made since the script was written. The script still mostly works, it just occasionally creates items with bad links and/or the wrong information. Unfortunately, I'm pretty busy these days and not very motivated to fix this script. If you fix it please post the updated version. Sorry, and thanks. ryos 07:35, 6 November 2007 (UTC)
  2. Linking to subpages (this page you're reading is a subpage) is broken due to the / character being urlencoded to %2F
  3. It probably won't work for non-english wikis. I say this because I had to make some changes to the parsing code from Silvain Schmitz's french-localized version in order to get it to work with en.wikipedia.org.
  4. I've found that NetNewsWire does not pick up edits to articles that have appeared in the feed before and remain cached by the application. There are two ways to work around this:
    1. You could enable the NNW preference to highlight differences to updated articles, or
    2. You could get info on your subscription, set it to use special persistence settings, and set items to disappear from the feed after 0 days. This is my preferred option.
I suppose that it may be possible to modify the feed such that these workarounds are not necessary, but frankly I know nothing about RSS feeds; I was able to modify Sylvain's code, but that's the extent of it.

Wishlist[edit]

If you have the time, ability, and inclination to add any of these (or an idea of your own), please do. It's open source, after all.

  1. Configuration via command-line arguments. This is actually implemented, but commented out because I couldn't get it to work with NetNewsWire, and I don't know why.
  2. Automatically get timezone with an Applescript wrapper. Not sure if possible, but it should be.
  3. Ability to grab watchlists from more than one wikimedia project, and built-in localizations to them all. Perhaps just the localized versions will do, since any newsreader worth its salt can aggregate multiple subscriptions into one.

(shut up and show us) The code[edit]

#!/usr/bin/php
<?php
/* Todo
	-Determine if anonymous edits could be supported without $entries; this is
	 currently the array's only purpose
	
	-Make configurable
		-Look into getting timezone w/applescript
		-Look into setting username/pass w/NNW params
			-NNW appears to futz the params. D'oh!
	
*/
/*******************************************************************************
watchlistrss - a script that produces a feed of your watchlist.
Based on Silvain Schmitz's script: http://meta.wikimedia.org/wiki/User:Sylvain_Schmitz/Watchlist_RSS_feed_in_PHP
Modified by Ryan Ballantyne (Ryos)

My changes from the original include:
	-Make runnable from the PHP command-line SAPI
	-Localize to English Wikipedia
	-Change the RSS feed information to a format I find more useful
	-Change output method to play nice with NetNewsWire

What it's for:
	This script reads your watchlist from wikipedia and transforms it into an
	RSS feed that can be read by a newsreader that has the ability to subscribe
	to scripts on the local machine. The only reader I know of with this ability
	is NetNewsWire on the Mac.

How to use it:
	1) Copy this code to a text file and name it with a .php extension.
	   When saving, keep in mind that the script saves a cookie file in the same
	   directory.
	2) Configure the script. Set the following variables below:
		$wp_name = 'yourname';
		$wp_password = 'yourpass';
		$wp_tmz = 'your timezone offset from GMT';  //ex: -06:00 (that's my zone)
		$script_name = "the name you gave the script.php";  //This is used to name the cookie
	3) Make the script executable. To do this, open the terminal and type:
	   chmod +x 
	   Then, drag the script file to the terminal window and press return.
	4) Subscribe to the script in NetNewsWire. Make sure to change the script type
	   from "Applescript" to "Shell Script".
	5) You are now one hoopy frood. Enjoy.
	
Known bugs/issues:
	-Linking to subpages is broken due to the / character being urlencoded to %2F
*******************************************************************************/

  /****************************************************************** Setup. */
  $printDebug = false;

  // time zone on the server; default to GMT
  /*$wp_tmz = "+00:00";
  
  // Parse the options
  $script_name = $argv[0];
  
  for ($i = 1; $i < count ($argv); $i++)  {	
      switch ($argv[$i])  {
        case "-u":
        case "--user":
		  $wp_name = $argv[$i+1];
		  $i++;
		  break;
		case "-p":
		case "--pass":
		  $wp_password = $argv[$i+1];
		  $i++;
		  break;
		case "-t":
		case "--timezone":
		  $wp_tmz = $argv[$i+1];
		  $i++;
		  break;
		case "-d":
		  $printDebug = true;
		  break;
	  }
  }
  if (empty ($wp_name) || empty ($wp_password))  {
  	exit ("\nUsage: [-u|--user] username [-p|--pass] password\n\n");
  }*/
  
  $wp_name = 'yourusername';
  $wp_password = 'yourpassword';
  $wp_tmz = "+00:00";
  $script_name = 'watchlistrss.php';
  
  //Set error reporting based on if we're debugging
  if (!$printDebug)  { ini_set ('display_errors', '0'); }
  else  { ini_set ('display_errors', '1'); }

  // default domain and path
  $wp_domain    = 'en.wikipedia.org';
  $wp_watchlist = '/wiki/Special:Watchlist';

  // maximum number of entries in the feed
  $max_entries = 20;

  // localized array for month names
  $months = array ("January" => "01", "February" => "02", "March" => "03",
                   "April" => "04", "May" => "05", "June" => "06",
                   "July" => "07", "August" => "08", "September" => "09",
                   "October" => "10", "November" => "11", "December" => "12");

  // localized user pages prefix
  $wp_userpage = "User:";

  // localized title
  $wp_title = "Watchlist";

  // localized description
  $wp_description = "$wp_name's $wp_title";

  /*********************************************************** End of setup. */

  // name of the cookie file
  $cookie_file  = $script_name .'_'. $wp_domain .'_cookie';

  // get the expiration time from the cookie
  $time = 0;
  $cookie_fp = fopen ($cookie_file, "r");
  if ($cookie_fp)
    {
      while (!feof ($cookie_fp))
        {
          $cookie = fgets ($cookie_fp, 4096);
          if (strpos ($cookie, "wikiUserID") !== FALSE)
            {
              $ce = explode ("\t", $cookie); 
              $time = $ce[4];
              break;
            }
        }
      fclose ($cookie_fp);
    }

  // check whether a new login is needed
  if (($time - 60) < time ())
    {
      // login URL
      $wp_login = '/w/index.php?title=Special:Userlogin'
        .'&action=submitlogin&type=login';

      // login connection
      $login = curl_init ();

      $postdata = array ();
      $postdata['wpName']         = $wp_name;
      $postdata['wpPassword']     = $wp_password;
      $postdata['wpRemember']     = '1';
      $postdata['wpLoginattempt'] = 'true';
      $post = null; 
      foreach ($postdata as $key=>$value)
        if ($key && $value) 
          $post .= $key."=".urlencode($value)."&";
	  
	  curl_setopt ($login, CURLOPT_MUTE,         TRUE);
      curl_setopt ($login, CURLOPT_POST,         TRUE); 
      curl_setopt ($login, CURLOPT_POSTFIELDS,   $post);
      curl_setopt ($login, CURLOPT_COOKIEJAR,    $cookie_file);
      curl_setopt ($login, CURLOPT_URL,          $wp_domain.$wp_login);
      curl_exec   ($login);
      curl_close  ($login);
    }

  // grab the contents
  $content = curl_init ();
  curl_setopt ($content, CURLOPT_RETURNTRANSFER, TRUE);
  curl_setopt ($content, CURLOPT_COOKIEFILE,     $cookie_file);
  curl_setopt ($content, CURLOPT_COOKIEJAR,      $cookie_file);
  curl_setopt ($content, CURLOPT_URL,            $wp_domain.$wp_watchlist);
  $watchlist = curl_exec ($content);
  curl_close  ($content);

  // function for ISO8601 time and date
  function to_iso8601 ($date_str)
    {
      global $months;
      $date_fields = explode (" ", $date_str);
      $day = $date_fields[0];
      if (strlen ($day) == 1)
        $day = "0".$day;
      $month = $date_fields[1];
      $year = $date_fields[2];
      return $year."-".$months[$month]."-".$day."T";
    }
    
   
  // explode the contents by days
  define ('LENGTH_TIMESTR', 5);
  define ('ANON_TITLETEXT', 'Special:Contributions');
  
  $days     = explode ("<h4>", $watchlist);
  $links    = array();
  $titles   = array();
  $descriptions = array();
  $entries  = array();
  $times    = array();
  $authors  = array();
  $nentries = 0;
  for ($i = 1; $i < sizeof ($days) && $nentries < $max_entries; $i++)
    {
      $the_date = to_iso8601 (substr ($days[$i], 0,
                                      strpos ($days[$i], "</h4>")));
                                      
	  $lines = explode ("<br />", $days[$i]);
      $tmp = explode (" . . ", $days[$i]);
      
      //debug
      if ($printDebug)  {
		  echo "\$lines $i:";
		  echo "\n"; print_r ($lines); echo "\n\n";
		  echo "tmp $i:";
		  echo "\n"; print_r ($tmp); echo "\n\n";
	  }
      
      for ($j = 0; $j < sizeof ($tmp)-1 && $nentries < $max_entries; $j++)
        {
          //links
          $offset = strpos ($lines[$j], '<a href="') + 15;
          $links[$nentries] = substr ($lines[$j], $offset,
                                      strpos (substr ($lines[$j], $offset), '"'));
          
          //descriptions
          $offset = strpos ($lines[$j], '<tt>');
          $descriptions[$nentries] = substr ($lines[$j], $offset);
          
          
          //entries
          $offset = strpos ($tmp[$j+1], ' title="') + 8;
          $entries[$nentries] = substr ($tmp[$j+1], $offset,
                                  strpos (substr ($tmp[$j+1], $offset), '"'));
          
          //times
          $offset = strpos ($tmp[$j], '; ') + 2;
          $times[$nentries] = $the_date.substr ($tmp[$j], $offset, LENGTH_TIMESTR).$wp_tmz;
          
          //authors
          //Anonymous edits result in different output; we must treat it specially
          if ($entries[$nentries] != ANON_TITLETEXT)  {
		    $offset = strpos ($tmp[$j+1], ' title="'.$wp_userpage) 
						+ 8 + strlen ($wp_userpage);
		    $authors[$nentries] = substr ($tmp[$j+1], $offset,
									  strpos (substr ($tmp[$j+1], $offset), '"'));
		  }
		  else  {
		  	$offset = strpos ($tmp[$j+1], ' title="'.ANON_TITLETEXT) 
						+ 8 + strlen (ANON_TITLETEXT) + 2;
		    $authors[$nentries] = substr ($tmp[$j+1], $offset,
									  strpos (substr ($tmp[$j+1], $offset), '<'));
		  }
          
          //titles
          $offset = strpos ($lines[$j], ' title="') + 8;
          $titles[$nentries] = substr ($lines[$j], $offset,
                                  strpos (substr ($lines[$j], $offset), '"'));
          $titles[$nentries] .= ' . . '. $authors[$nentries];
          
          $nentries++;
        }
    }
    
    //debug
    if ($printDebug)  {
		echo "links:\n"; print_r ($links);
		echo "titles:\n"; print_r ($titles);
		echo "descriptions:\n"; print_r ($descriptions);
		echo "entries:\n"; print_r ($entries);
		echo "times:\n"; print_r ($times);
		echo "authors:\n"; print_r ($authors);
	}

  /********************************************************* RSS generation. */

  $disallowed_xml = array ("&", "<", ">");
  $replacements_xml = array ("&amp;", "&lt;", "&gt;");
  
  $output = '';

  // header
  $output .= "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n";
  $output .= "<!DOCTYPE rdf:RDF [\n";
  $output .= "<!ENTITY % HTMLlat1 PUBLIC\n";
  $output .= " \"-//W3C//ENTITIES Latin 1 for XHTML//EN\"\n";
  $output .= " \"http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent\">\n";
  $output .= "]>\n";
  $output .= "<rdf:RDF\n";
  $output .= "  xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" \n";
  $output .= "  xmlns:sy=\"http://purl.org/rss/1.0/modules/syndication/\"\n";
  $output .= "  xmlns:dc=\"http://purl.org/dc/elements/1.1/\"\n";
//$output .= "  xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"\n";
  $output .= "  xmlns=\"http://purl.org/rss/1.0/\"\n";
  $output .= ">\n";
      
  // channel summary
  $output .= "  <channel rdf:about=\"http://"
         .$wp_domain.str_replace ($disallowed_xml,
                                  $replacements_xml,
                                  $wp_watchlist)."\">\n";
  $output .= "    <title>$wp_title</title>\n";
  $output .= "    <link>http://"
         .$wp_domain.str_replace ($disallowed_xml,
                                  $replacements_xml,
                                  $wp_watchlist)."</link>\n";
  $output .= "    <description>$wp_description</description>\n";
  $output .= "    <dc:source>http://"
         .$wp_domain.str_replace ($disallowed_xml,
                                  $replacements_xml,
                                  $wp_watchlist)."</dc:source>\n";
  $output .= "    <dc:date>".date("Y-m-d\TH:iO")."</dc:date>\n";
  $output .= "    <sy:updatePeriod>hourly</sy:updatePeriod>\n";
  $output .= "    <sy:updateFrequency>4</sy:updateFrequency>\n";
  $output .= "    <sy:updateBase>1970-01-01T00:00+00:00</sy:updateBase>\n";
  $output .= "    <items>\n";
  $output .= "      <rdf:Seq>\n";
  for ($i = 0; $i < $nentries; $i++)
    {
      $output .= "        <rdf:li resource=\"http://$wp_domain/wiki/"
             .urlencode(str_replace (" ", "_", $links[$i]))."\" />\n";
    }
  $output .= "      </rdf:Seq>\n";
  $output .= "    </items>\n";
  $output .= "\n";
  $output .= "  </channel>\n";

  // items
  for ($i = 0; $i < $nentries; $i++)
    {
      $output .= "  <item rdf:about=\"http://$wp_domain/wiki/"
             .urlencode(str_replace (" ", "_", $links[$i]))."\">\n";
      $output .= "    <title>".$titles[$i]."</title>\n";
      $output .= "    <description>{$descriptions[$i]}</description>\n";
      $output .= "    <dc:creator>".$authors[$i]."</dc:creator>\n";
      $output .= "    <dc:date>".$times[$i]."</dc:date>\n";
      $output .= "  </item>\n\n";
    }

  // footer
  $output .= "</rdf:RDF>\n";
  
  exit ($output);
?>