User:Cpiral/relink.pl

From Wikipedia, the free encyclopedia

This is listed at Wikipedia:Tools/Editing tools § Relink (starting 17 Dec 2015) after being used dozens of times to cleanup redlinks posted at category:wikipedia red link cleanup.

Purpose[edit]

Purposes:

Usage[edit]

Given some wikitext it can list all the links. This list becomes your links-configuration file. You edit it to remove links. To add links, you type up a list the links you want to add and make that your links-configuration file. Then you rerun the script against the wikitext to produce the desired linkage for that wikitext.

See the output of relink -h for usage and instructions. You'll need perl 5 and its getopts module from CPAN.

Use redirection or piping to specify < input and > output source files. You name your own input, output, and configuration files.

Use command-line options

  • -l source_filename to list, or to create your links configuration file.
  • -k links_configfile to keep
  • -r links_configfile to remove
  • -a links_configfile to add

to keep, remove, or add the links in your links configuration filename.

So to modify the way a file is linked, you can

  • add links from a list you wrote (a links-configuration file).
  • remove links listed in an auto-generated links-configuration file you edited.
  • keep links listed in an auto-generated links-configuration file you edited.

Save the output of relink -l to generate the links configfile. All the links are listed, and they're in the order they were found. Then, while viewing both the rendered page and the links configuration file, you use the rendered page to decide where to jump to in the configuration file to do the removal of links. The editing is only the removal of one or more lines. What remains may be what's kept or whats removed from the linkage.

For example to cleanup redlinks, first gague which is greater, the redlinks or the blue links. If most of the links are blue, remove redlinks and use relink -k. If most of the links are red, remove blue links and use relink -r.

Examples[edit]

What is outside the link does not count for uniqueness.

$ cat wikitext
[[link]] [[link|label]] 3[[link]]ed  4[[link|label]]ling

$ relink -c wikitext
2 link
2 link|label
4 total wikilinks

$ relink -l wikitext
link
link|label
2 unique wikilinks

Remove or keep[edit]

$ cat wikitext
[[title]] [[title|label]]  [[title3]]ed  4[[title4|label]]ling

$ relink -l wikitext > links
4 unique wikilinks

$ cat links
title
title|label
title3
title4|label

Editing the file we called links here, and removing two lines...

$ cat links
title3
title4|label

Here's two opposite uses of the remaining two lines, for the sake of example.

$ relink -r links < wikitext
[[title]] [[title|label]]  title3ed  4labelling
2 links removed

$ relink -k links < wikitext
title label  [[title3]]ed  4[[title4|label]]ling
2 links removed

To save output, use redirection

$ relink -r links < wikitext > processed_file

You can use the processed file to act as new wikitext to do more linkage configuration before uploading the final processed_file to the edit box.

Add[edit]

$ cat wikitext
[[title]] [[title|label]]  label label title title

$ cat promote
  title
  title | label

$ relink -a promote < wikitext
[[title]] [[title|label]]  [[title|label]] label [[title]] title
2 links added.

$ relink -ma promote < wikitext
[[title]] [[title|label]]  [[title|label]] [[title|label]] [[title]] [[title]]
4 links added.

Source[edit]

#!/usr/bin/perl
# Cpiral at gmail, User:Cpiral
#!/usr/bin/perl
use Getopt::Std; getopts 'l:u:r:k:a:c:hm';
use English;
$LIST_SEPARATOR = "";
=pod
Development/testing imperitives:
+   output deleted titles for talk page report (else info lost)
+   use strict compliance to lexify global variables
=cut
$ignore = qr/category|image|file|media/i;
BEGIN {
    $USAGE = '
    Process your [[ link "title" | link "label" ]] structures.

    source_file: original wikitext (You must download it.)
    link_configfile: list of labels. You name and create it.
    processed_file: final wikitext (You can reprocess it.)

    To remove links:

        1) relink -l source_file > link_configfile
        The -l option automatically creates a linkage snapshot.
        You can manually create your own instead of this step.

        2) Edit link_configfile. 
        Change the snapshot into a new, wanted configuration.
        You only delete lines.  (See next for which ones.)
        
        3)

            a) relink -r link_configfile < source_file > processed_file
            The -r option removes the labels from their linkage-markup.
            In this case the list of labels are unwanted, e.g. redlinks.

            OR

            b) relink -k link_configfile < source_file > processed_file
            The -k will keep _only_ the list of "keeper" labels.
            The processed_file will have all _other_ links removed.
            (Relink ignores the Category, Image, Media, or File namespace.)
            In this case the list of labels are a new snapshot of linkage.

    Note that processed_file is a source_file, and can be reprocessed.
    You preview by leaving off the output-redirection: > processed_file.

    To add a set of missing links to a list of pages, for each page:
        
        relink -a link_configfile < source_file > processed_file

    Hand create your own link_configfile

    Synopsis of relink:
    relink -l source_file
    relink { -r | -k | -[m]a } link_configfile
    relink [-c] source_file
    -l outputs the labels of all links in the source_file
    -r removes linkage from all given labels in link_configfile
    -k keeps only links given in the link_configfile, removes others
    -a adds links given in the link_configfile, ignores others
        -ma (multiple adds) links every occurance
    -c outputs the count of links in the wikitext


    ';
}

if ( $opt_h ) {
    print $USAGE;
    exit;
}

# Input the MediaWiki page source
if ($opt_l or $opt_c){
    $either = $opt_l ? $opt_l : $opt_c; 
    open (SOURCE, "<", $either ) or die "Cannot read $either: $!";
    while ( <SOURCE> ) # wikitext 
    {
        if ( m/\[\[/ ) { # if wikitext may have a link
            # then get all links on that line
            # ?! matches by look-ahead
            # .*? matches ASAP, and (.*?) is captured as $1
            while (m/\[\[(?!$ignore)(.*?)\]\]/g) { 
                push @links, "$1\n"; # entire|insides
            }
        }
    }

    foreach $link (@links) {
        $seen{$link}++;
    } # needs some kind of order

    $count_unique = $count_total = 0;
    foreach $link (@links) {
        if ( $opt_c ) {
            print "$seen{$link} " if $seen{$link};
            $count_total += $seen{$link};
        }
        if ($seen{$link}) {
            print "$link"; 
            $count_unique ++;
            delete $seen{$link};
        }
    }
#close SOURCE;
print STDERR "$count_unique unique wikilinks \n" if $opt_l;
print STDERR "$count_total total wikilinks \n" if $opt_c;
}

if ($opt_a) {
    $count = 0;
    open (LINK_CONFIGFILE, "<", $opt_a ) or die "Cannot read $opt_a: $!";
    @add = <LINK_CONFIGFILE>;
    chomp (@add);
    foreach ( @add ) {
        if ( /[|]/ ) {
            # e.g. wikt:neutralize | neutralize
            ($title,$label) = split /\s*\|\s*/; # configfile ignores spacing
            $label =~ s/\s+$//; # no hidden whitespace
            $title =~ s/^\s+//; # no leading whitespace
            $links{$label} = "[[$title|$label]]";
        } else { # title needs no label
            s/^\s+//; # no leading whitespace
            s/\s+$//; # no trailing whitespace
            $links{$_} = "[[$_]]";
        }
    }
    while ( <> ) # reading links_configfile
    {
        foreach $phrase ( keys %links ) { # title or title|label

            if ( not $opt_m ) { # feature: link nth occurance
                if ( m/$phrase(?! *(\||\]\]))/ ) {  # looking ahead, no | or ]]
                    # next regexp says "followed by neither ]] nor |"
                    s/$phrase(?! *(\||\]\]))/$links{$phrase}/; 
                    delete $links{$phrase}; # link first occurance
                    $count++;
                }
            }
            else { # link every occurance

                if ( m/$phrase(?! *(\||\]\]))/ ) { #
                    $count++ while m/$phrase(?! *(\||\]\]))/g;  # count matches
                    s/$phrase(?! *(\||\]\]))/$links{$phrase}/g; # replace matches
                }
            }
        }
        print;

    }
    print STDERR "$count links added.\n";
}

if ($opt_r) {

    open (LINK_CONFIGFILE, "<", $opt_r ) or die "Cannot read $opt_r: $!";
    @remove = <LINK_CONFIGFILE>;
    chomp @remove;

    $count = 0;
    while ( <> ) {
        if ( m/\[\[/ ) { 
            foreach $link ( @remove ) {
                # autogenerated configuration file line format: title | label 
                $replacement = ($link =~ s/.*\|//r); # replacement is label
                $count++ if s/\Q[[$link]]\E/$replacement/; # replace link
            }
        }
        print STDOUT;
    }
    print STDERR "$count links removed\n";
}


if ($opt_k) {

    @source = <>;

    open (LINK_CONFIGFILE, "<", $opt_k ) or die "Cannot read $opt_k: $!";
    @keep = <LINK_CONFIGFILE>;
    chomp @keep;

    foreach (@source) {
        if ( m/\[\[/ ) { 
            while (m/\[\[(?!$ignore)(.*?)\]\]/g) { # ".*?" matches ASAP
                push @oldlinks, $1; 
            }
        }
    }
    @diff{@oldlinks} = @oldlinks;
    delete @diff{@keep};
    @remove = keys %diff;

    foreach ( @source ) {
        $source = $_;
        if ( m/\[\[/ ) { 
            foreach $link (@remove) {
                # structure: [[ title | label ]]
                $replacement = ($link =~ s/.*\|//r); # replacement is label
                $count++ if 
                    $source =~ s/\Q[[$link]]\E/$replacement/; # replace link
            }
        }
        print STDOUT $source;
    }
    print STDERR $count ? $count : 0, " links removed\n";
}

See also[edit]