Wikipedia:WikiProject Disambiguation/Database dump analysis

From Wikipedia, the free encyclopedia

A database dump is a backup of all Wikipedia pages, which can then be downloaded. Once downloaded, extensive analysis can performed on the dump (this can't be done by scraping live from the servers because it creates excessive load).

Database dump analysis can help WikiProject Disambiguation achieve its goals by providing editors with extra information.

Currently run dump analyses[edit]

articles categories portals templates
pages links pages links pages links pages links pages links
2005-11-13 33102 412194 32166 410987 936 1207
2005-12-13 34475 425520 34126 425120 349 400
2006-03-03 39928 465726 38238 463507 578 836 429 495 683 888

Proposal: tracking down dab pages with suspect style[edit]

At WP:DAB wangi expressed interest in using the dumps to aid dab page style (by tracking down suspect dab pages). One could argue that Category:Disambiguation pages in need of cleanup is always plentifully stocked and that a dump analysis to find more troublesome dabs is unnecessary. But then again, who could have perceived the activity around From templates that resulted in completion of that report.

Ideas[edit]

Image and template checks...

Dab pages are checked for:

  • Images
  • Templates (other than dab templates naturally, including stubs templates etc)
Images and templates indicate that a dab page is verging on article status. An expert can examine the dab and perform merging, start discussion etc.
Talk page is a redirect?
  • If a page has a dab template then it should have its own talk page. Due to page moves, often a dab's talk page redirects elsewhere (no redirect should be present). A listing of dab pages without their own talk pages would be helpful.
Link checking...
  • Check the ratio of wikilinks to number of lines for page. The idea being that the higher the value the more in need of cleanup a page is (generally).
  • Check for piping of links. Generally piping should not be present on dab pages. Perhaps check the gross number of piped links