User:Graham87/SHA-1

From Wikipedia, the free encyclopedia

This page describes a method I developed in 2015 to find edits that have been deleted from old Wikipedia databases.


The problem[edit]

Since 2008, I have been trying to find edits that are in older Wikipedia databases that were mistakenly deleted (see my page history observations subpage. A problem with this endeavour is that most of the fields in the relevant parts of the Wikipedia database, the page and revision tables, are (or have been) liable to change during the course of editing. A page's title changes whenever it has been moved. Even the page_id and rev_id fields, which are primary keys, are not guaranteed to be stable; the page ID field was reset whenever a page is deleted until the release of MediaWiki 1.27, and this was also true of the revision ID field until Wikipedia was upgraded to MediaWiki 1.5 in June 2005. Neither of these fields are useful when dealing with edits that were cleared from the Wikipedia database in 2004!

My solution[edit]

The revision table has two fields that in combination will almost always have unique and reliably constant values: the rev_timestamp and rev_sha1 fields. The rev_timestamp field contains the timestamp of each edit (i.e. the time that it was made, to the nearest second) and the rev_SHA-1 gfield contains the SHA-1 value of the text of each edit.[note 1] Therefore, while there have been incidents (especially in 2002) when timestamps have been corrupted, by and large they will be consistent across time. It is extremely unlikely that two edits will have the same SHA-1 values unless they have the same text.

I used MySQL queries on copies of the January and May 2003 database dumps which have been upgraded to work on MediaWiki 1.25.1, like this:

SELECT
  rev_timestamp,
        rev_sha1,
  page_namespace,
  page_title
  FROM revision
JOIN page
ON page_id = rev_page
WHERE rev_timestamp < 20030630000000
ORDER BY rev_timestamp ASC
INTO outfile 'sha1.txt';

This query was used for the January 2003 database dump; for the database dump from May 2003, I used a similar query but I replaced the timestamp with 20030111070503, this being the final relevant timestamp in the January 2003 database dump, and changed the filename to "sha2.txt".

The above query does not work on more recent versions of the Wikipedia database, like that at Wikimedia Labs, because most of the edits were made after 2003. Therefore, I used the following query in this case, and edited out the surplus edits manually:

SELECT
  rev_timestamp,
        rev_sha1,
  page_namespace,
  page_title
  FROM revision
JOIN page
ON page_id = rev_page
ORDER BY rev_timestamp ASC
LIMIT 1500000;

Here is a query similar to one I used on the Wikimedia labs archive table, with the limit being adjusted to roughly correspond with the may 2003 database dump:

SELECT
ar_timestamp,
        ar_sha1,
  ar_namespace,
  ar_title
  FROM archive
ORDER BY ar_timestamp ASC
LIMIT 52000;

In the case of Wikimedia Labs, the results of a query performed from the command line using the grid engine are outputted to a file automatically.

Here is some Python code that I wrote to process the resulting files.

#!python3
def process_sha1_file(filename):
    """Processes a file where each line contains a timestamp,
    SHA-1, namespace,and title value into a dictionary
    where the timestamps and SHA1 values of each edit are mapped
    to titles (which are formatted as sets), on the
    unlikely event that a timestamp/SHA-1 combination
    appears more than once."""
    global namespace
    fin=open(filename,encoding='utf8')
    dict1=dict()
    for line in fin:
        temp_list=line.strip().split(sep=chr(9))
        if temp_list[0]+temp_list[1] not in dict1:
            dict1[temp_list[0]+temp_list[1]]={namespace[temp_list[2]]+temp_list[3]}
        else:
            dict1[temp_list[0]+temp_list[1]].add(namespace[temp_list[2]]+temp_list[3])
    fin.close()
    return dict1

def subtract_dicts(dict1,dict2):
    """Returns a dictionary containing the edits that are
    in dict1 but not in dict2."""
    dict3={}
    for key in dict1.keys():
        if key not in dict2:
            dict3[key]=dict1[key]
        elif len(dict2[key]) < len(dict1[key]):
            dict3[key]=dict1[key]-dict2[key]
    return dict3

def create_dictionary_of_titles(dict):
    """Creates a dictionary with a title as the key
    and the number of edits made to that title as a value."""
    dict2={}
    for key in dict.keys():
        for title in dict[key]:
            dict2[title]=dict2.get(title,0)+1
    return dict2

#Create dictionary mapping namespace numbers to namespace names
namespace={'0':'','1':'Talk:','2':'User:',
'3':'User_talk:','4':'Wikipedia:','5':'Wikipedia_talk:',
'6':'File:','7':'File_talk:','8':'MediaWiki:',
'9':'MediaWiki_talk:','10':'Template:','11':'Template_talk:',
'12':'Help:','13':'Help_talk:','14':'Category:',
'15':'Category_talk:','100':'Portal:','101':'Portal_talk:',
'108':'Book:','109':'Book_talk:','118':'Draft:',
'119':'Draft_talk:','120':'Education_program:','447':'Education_program_talk:',
'828':'Module:','829':'Module_talk:','2600':'Topic:'}
#Create the two dictionaries, subtract them, then create another dictionary mapping titles to number of edits
dict1=process_sha1_file('sha1.txt')
dict2=process_sha1_file('sha2.txt')
dict3=subtract_dicts(dict1,dict2)
dict4=create_dictionary_of_titles(dict3)
#Sort the titles by number of edits
titles_with_values=[]
for key in dict4:
    titles_with_values.append((dict4[key],key))
titles_with_values.sort(reverse=True)
#Write the results to a file, in this case "missedit.txt"
fout=open('missedit.txt','w',encoding='utf8')
for item in titles_with_values:
    fout.write(str(item[0])+' '+item[1]+'\n')
fout.close()

Notes[edit]

  1. ^ The SHA-1 value of the text differs depending on its encoding. At the time of writing this text in 2015, The English Wikipedia's early edits were encoded in Windows-1252 (nominally ISO/IEC 8859-1), while edits made after MediaWiki was upgraded to MediaWiki 1.5 in June 2005 were in UTF-8. The latter encoding is also used for any edits imported after that date, so SHA-1 values would differ between an edit in an old database and its corresponding imported edit in the modern database. They would also differ between old and new databases when an edit in the new database has been deleted before the MediaWiki 1.5 upgrade and undeleted afterwards, again due to the difference in encodings. As of June 2023, all text in the English Wikipedia's database is encoded in UTF-8 (see T128151).