User talk:WebCiteBOT/Archive

AfD[edit]

Is it doing what you want with "Riders of the Flood"? It has an AfD tag, and the bot apparently archived several pages for it. - Peregrine Fisher (talk) (contribs) 02:47, 18 April 2009 (UTC)[reply]

I'm basing this on User:WebCiteBOT/Logs/2009-04-17.log. Where it says "Attempting to archive http://www.tuckwillergallery.com/...SUCCESS" etc. - Peregrine Fisher (talk) (contribs) 02:48, 18 April 2009 (UTC)[reply]

I hadn't actually implemented the feature to skip AfD articles (forgot). I will do so now. --ThaddeusB (talk) 13:51, 18 April 2009 (UTC)[reply]

Cool. - Peregrine Fisher (talk) (contribs) 15:46, 18 April 2009 (UTC)[reply]

Here there is a bare URL, which the bot then formats with the cite web template. The problem is that the page doesn't have a references section. It also doesn't do anything to the ref in single brackets, which may be intentional, I don't know. - Peregrine Fisher (talk) (contribs) 07:14, 19 April 2009 (UTC)[reply]

I have now added a function to add a reference section if it converts a bare URL and no reference section exists yet. At this time, it doesn't touch any other bare URLs. --ThaddeusB (talk) 14:38, 19 April 2009 (UTC)[reply]

Sounds good. - Peregrine Fisher (talk) (contribs) 16:28, 19 April 2009 (UTC)[reply]

Manually run the bot on a given article[edit]

I would love an option (maybe via the toolserver) to set the bot loose on a given article. So if I want to archive all the references in any one article I'm working one, then I could go to a page and queue it for the bot to work on. Just an idea. — LinguistAtLarge • Talk 17:39, 24 April 2009 (UTC)[reply]

Another useful option would be to go after all WP links to a specific base URL. See Wikipedia:Administrators'_noticeboard#GeoCities_is_shutting_down for a couple of cases in point where this might be useful.LeadSongDog come howl 18:42, 24 April 2009 (UTC)[reply]

Thanks for the suggestions. I will look into expanding the bot's scope after the initial form is fully approved (which hasn't happened yet). --ThaddeusB (talk) 14:22, 25 April 2009 (UTC)[reply]

No hurry, but I've had the same thought. It would be great to archive all the links in an article that just made FA or GA and has had a thorough going through, for instance. - Peregrine Fisher (talk) (contribs) 14:51, 25 April 2009 (UTC)[reply]

@ThaddeusB - Sounds good, thanks. — LinguistAtLarge • Talk 20:08, 29 April 2009 (UTC)[reply]

Any updates on this? – Quadell ^(talk) 00:16, 8 May 2009 (UTC)[reply]

Still working out the kinks in the main version... almost done with that. {fingers crossed} --ThaddeusB (talk) 21:37, 9 May 2009 (UTC)[reply]

You can use Checklinks. Its main purpose is to fix dead links, but there is a button that says "Archive with cite web"Tim1357 (talk) 23:34, 26 September 2009 (UTC)[reply]

FT article[edit]

WebCiteBOT recently visited 10,000 Women and tried to archive this FT story, but the archived copy doesn't display correctly. Gareth Jones (talk) 23:19, 26 April 2009 (UTC)[reply]

Thank you for alerting me of this error. The problem is on WebCite's end (it archives a piece of javascript rather than the file requested.) I have just alerted them of the bug and hopefully it will be corrected shortly. In the mean time, I will write a check function to prevent such archives from being recognized as successful by my program. --ThaddeusB (talk) 23:57, 26 April 2009 (UTC)[reply]

Interwiki map[edit]

Is it worth having WebCite URLs listed on the meta:interwiki map? Mapping to http://webcitation.org/$1 should work as far as I can tell. That way instead of linking to http://www.webcitation.org/5gZqf5ajC in refs you could link to something like [[wc:5gZqf5ajC]] (looks like "wc" is free btw). I appreciate this isn't a huge difference, but clearing out extraneous addressing fragments from the mass of wikitext involved in ref formating makes it easier for humans to parse, plus as that page on meta shows we already have huge numbers of obscure linking shortcuts for websites that won't be linked to anywhere near as much as WebCite, so there seems to be consensus that these things can only be of help to editors. I thought I'd seen discussion somewhere where you or someone else said you'd like to build WebCite syntax into the cite templates, although I can't now find it. Implementing this in the interwiki map would avoid having to change the templates at all, which some might feel privileges a particular archiving website over others. The logic seems clear to me! If this isn't done, can I at least suggest removing the unnecessary www from WebCite URLs, on the grounds that any shortening of the ref clutter is helpful, and also that www is deprecated! 79.64.170.147 (talk) 02:07, 7 May 2009 (UTC)[reply]

Hello,

It was me that raised the idea of working WebCite it into the cite templates' code, but I quickly dropped the idea when it became clear there was no need.

I certainly wouldn't object to such a listing on the interwiki map. If you want to create a proposal (or whatever you want to call it) to add one, I'd be happy to comment on it. Just leave me a link to the discussion here if you decide to do so. Changing the bot's code would be a trivial matter if/when the change went live. --ThaddeusB (talk) 02:48, 7 May 2009 (UTC)[reply]

This is an excellent idea, in my opinion. If the IP wants to suggest it, I'll lend my support as well. — Huntster (t • @ • c) 03:31, 7 May 2009 (UTC)[reply]

Proposed here 79.64.254.219 (talk) 11:38, 7 May 2009 (UTC)[reply]

Thanks 219, I'll add my support. I've realised a slight problem with this, however. Since any archive url, be it WebCitation or the Internet Archive, should be placed in the |archiveurl= field of the various "Cite X" templates, an interwiki link such as [[wc:5gZqf5ajC]] won't function. For example:

[[[wikt:test]] "Example"]. Example.com. 2009-05-05. Archived from the original on 2009-05-08. {{cite web}}: Check |archiveurl= value (help)

Any ideas on how this can be fixed or otherwise manipulated to work? — Huntster (t • @ • c) 22:22, 7 May 2009 (UTC)[reply]

Yah, that is definitely an issue. I imagine it would take a change in the wiki software to correct, although possibly a fix to {{Citation/core‎}} template would do. I have no ideas on possible workarounds either. --ThaddeusB (talk) 21:37, 9 May 2009 (UTC)[reply]

Sure, but this is a good first step. All in good time. — Huntster (t • @ • c) 23:43, 9 May 2009 (UTC)[reply]

Many thanks to the creators of this![edit]

Many, many thanks to the creators of this! I always had this knawing feeling that the contributions I made would be lost or distorted in some way over time. But with this function you have put my mind at ease! Many, many thanks! Boyd Reimer (talk) 13:10, 15 May 2009 (UTC)[reply]

Love-fest pile on! I wish your bot had lips so I could kiss it. – Quadell ^(talk) 20:28, 15 May 2009 (UTC)[reply]

I applaud this effort as well. -- C. A. Russell (talk) 17:15, 3 June 2009 (UTC)[reply]

Thanks guys, the praise is appreciated. --ThaddeusB (talk) 18:55, 3 June 2009 (UTC)[reply]

WebCiteBOT – will you be my valentine? Pslide (talk) 12:22, 11 July 2009 (UTC)[reply]

Stats?[edit]

Is your bot keeping track of its work? It would be interested to know how many refs are added in a day and whatnot. - Peregrine Fisher (talk) (contribs) 05:22, 16 May 2009 (UTC)[reply]

I will put adding a feature to track statistics on my to do list. Thanks for the suggestion. --ThaddeusB (talk) 20:22, 18 May 2009 (UTC)[reply]

See User:WebCiteBOT/Stats - more stats will be added soon. --ThaddeusB (talk) 19:19, 27 May 2009 (UTC)[reply]

Nice job. Why hasn't the bot been going lately? - Peregrine Fisher (talk) (contribs) 19:24, 27 May 2009 (UTC)[reply]

Because I was adding and testing a feature to capture all the human supplied metadata on Wiki page in order to build a database of publishers and such. All done now, so the bot will be back in force later today. It should be running 24/7 by the weekend. --ThaddeusB (talk) 19:30, 27 May 2009 (UTC)[reply]

Cool! - Peregrine Fisher (talk) (contribs) 20:12, 27 May 2009 (UTC)[reply]

Links that are already archived[edit]

I have in the past manually supplied Internet Archive or WebCite archive links in a reference. How will WebCiteBot handle links that are already cited in the context of an archived version? I do hope it will avoid creating a circular reference (archiving an archived copy). --User:Ceyockey (talk to me) 01:34, 28 May 2009 (UTC)[reply]

Fortunately, the bot is intelligent enough to skip links that point to existing archives. --ThaddeusB (talk) 03:07, 28 May 2009 (UTC)[reply]

Reaction from WebCite[edit]

Do you have a notion of the reaction from the managers of WebCite to suddenly seeing an upswing in archiving activity as a result of this Bot completing its tasks? --User:Ceyockey (talk to me) 01:36, 28 May 2009 (UTC)[reply]

I have been in contact with Gunther Eysenbach throughout the process. They have been very supportive of the project (in fact, they had the idea of creating a bot like this in their FAQ before it was independently thought up here.) --ThaddeusB (talk) 03:07, 28 May 2009 (UTC)[reply]

Please clarify "not all links are caught when approximately 4 or more links are added at once"[edit]

I spend a lot of my time on Wikipedia adding references to articles that are nominated for deletion, and usually do all of my additions in one edit, so I often add "approximately 4 or more links" at once. I'd like something better than "approximately" to work on, so that I know how often I should save. For example, if I save an article after adding three links will that guarantee that they will be seen by this bot? Phil Bridger (talk) 22:26, 28 May 2009 (UTC)[reply]

I'll try to be as precise as possible... IRC has a hard limit on the number of characters allowed per line. The bot that reports to the IRC feed that I rely on puts all the added links on one line, regardless of how long this makes the line. After you take away the other info it reports (person who added the link, diff link, etc.) there are about 300 usable characters for the actual links. ~15-25 of these are taken by the wiki page name. For each URL reported the IRC bot also reports the number of times it has been added to Wikipedia & the number of links the addeing user has added. This takes up about 16 characters. So to get an exact measure you have to find the length of the URLs you added and add 16 bytes per URL (expect for the last one as it doesn't matter if the extra info gets cut off). When you get beyond about 275 bytes the URLs will start to get cut off. In practice, you get 4 URLs of normal length, or 3 longer ones, into 275 bytes. --ThaddeusB (talk) 23:57, 28 May 2009 (UTC)[reply]

OK, thanks, I think I can work with that. Phil Bridger (talk) 01:00, 29 May 2009 (UTC)[reply]

WebCiteBOT going wrong[edit]

This WebCiteBOT has recently edited the article I am working on, the 2008 French Grand Prix. Several times, now, it has added authors in refs where there are not any, so I don't want them. Archiving pages and all that is fine, but adding authors, especially as it has put the author for one of them as "glopes" (goodness knows how it came up with that) I do find it frustrating. Is there a way to fix this? Darth Newdar (talk) 07:07, 29 May 2009 (UTC)[reply]

Aargh, its done it on the 2008 German Grand Prix article now. I really do know what I am talking about when I say I do not want an author for some of these refs. Darth Newdar (talk) 07:14, 29 May 2009 (UTC)[reply]

Very interesting, and not good. Thaddeus, may I strongly suggest that the bot not add Author or any other data of this nature? Metadata elements are simply too prone to error to try and scrape them. — Huntster (t • @ • c) 10:04, 29 May 2009 (UTC)[reply]

First of all, thank you for your valuable contributions to Wikipedia. Now, in fairness the bot didn't actually edit the page several times - it edited it twice. It also did not re-add the author info you deleted - it just added author info to a newly added source. The author "glopes" came directly from the PDF file - in all likelihood someone with the last name "Glopes" was responsible for putting together the info in the file. Even if it is incorrect, it is not the bot's fault that the pdf contained an inaccurate author - that information was supplied by a human and just copied by the bot.

The software already has some checks to prevent bad author info from getting copied, but the "GrandPrix.com" slipped through so I will look into improving the code to prevent that from happening again. --ThaddeusB (talk) 18:41, 29 May 2009 (UTC)[reply]

Thanks. Darth Newdar (talk) 19:22, 29 May 2009 (UTC)[reply]

Its done the author info "glopes" on the 2008 Turkish Grand Prix now. Darth Newdar (talk) 11:07, 2 June 2009 (UTC)[reply]

www.webcitation.org not working?[edit]

Hey what's going on with www.webcitation.org?, I haven't been able to manually archive anything for a while now, is anyone else getting constant error messages? Argh. -- œ ^™ 20:12, 5 June 2009 (UTC)[reply]

Yes, I got the error message too. Rettetast (talk) 20:23, 5 June 2009 (UTC)[reply]

Yah it has been down for at least 48 hours straight now. These down times seem fairly common, but this is the longest I've seen. --ThaddeusB (talk) 23:58, 5 June 2009 (UTC)[reply]

WebCiteBOT on nl?[edit]

Hello Thaddeus, I think that this bot is a great tool. I've been running weblinkchecker.py on nl: until now, but your bot is much better. Could you run it also on nl: (I'm quite sure that it will get approved), or alternatively, could I use your script on nl.wikipedia? Please let me know how you prefer to approach other language editions of wikipedia. Kind regards, --Maurits (talk) 08:02, 7 June 2009 (UTC)[reply]

I do plan to port the BOT to other Wikipedia's, but the task is a little more complicated than just changing the name of the encyclopedia as each site has their own conventions for how references are handled (and of course non-English template names, "reference" sections, and such). I am still perfecting the English version, but when I am ready to start porting it I will be sure to contact you for help. Thank you for the offer. --ThaddeusB (talk) 19:00, 7 June 2009 (UTC)[reply]

Thank you for your reaction, I'll be patient :). Some details about the Dutch Wikipedia in advance:

Our IRC is #wikipedia-nl-vandalism.
We don't have a dead-linktemplate, as an alternative I normally use <--! dode link -->.
Our deletion-templates (or for the consideration thereof) are: Artikelweg, Auteur, Auteur2, WB, Reclame, Wiu, Nuweg, Nuweg-reclame, Transwiki, Weg2, NE, Xauteur, Xreclame, Xwb, XNe, Xweg, Xwiu. Their prefix is Sjabloon:. There are some redirects to these templates too: Weg, Artweg, Wb, Woordenboekdefinitie, Promo, Promotie, WIU, Delete, Speedydelete, Speedy, Db, Reclame-nuweg. (This enumeration excludes those for images, files, categories, et cetera; if my assumption that these are unnecessary is false, please let me know).
Our web reference-templates are 'Cite web' (identical to the english version) and 'Voetnoot web' (almost identical; if you need some translation/interpretation, let me know). Redirects: 'Citeweb', 'Cite Web' (to the former), 'Citeer web' (to the latter).
Links can be mainly found between <ref></ref>-syntax and in reference sections with the titles: 'Externe verwijzing', 'Externe verwijzingen', 'Voetnoten', 'Voetnoot', 'Referenties', 'Noten', 'Bronvermelding'. The following templates include a <references />-tag: Reflist, Referenties, Bronnen/noten/referenties, Bron2, Bron3, Ref, References, Appendix, Noot. Redirects: Bron, Refs, Bronnen, Bronnen en referenties, Bronnen/noten/referenties/doc, Note.

If I can be of any more help, please let me know. Kind regards, --Maurits (talk) 22:04, 7 June 2009 (UTC)[reply]

Later in the alphabet?[edit]

Can you estimate when the bot will get to articles later in the alphabet? It seems to be doing one that start with numbers and punctuation. Thanks. - Peregrine Fisher (talk) (contribs) 07:01, 8 June 2009 (UTC)[reply]

I am pretty sure the bot is ready to go "full steam" now that I finally got all the important features added and tested. Right now, it is just a matter of Webcitation.org staying online long enough for the bot the do its thing. It seems WebCite has been down about 4.5 of the last 5 days. --ThaddeusB (talk) 15:07, 8 June 2009 (UTC)W[reply]

Sounds good. You don't think your bot has anything to do with them being down, do you? Wikipedia probably does more refs than academe, although I don't know much about it. - Peregrine Fisher (talk) (contribs) 15:52, 8 June 2009 (UTC)[reply]

I don't think the two are related, but I can't say for sure. Lets hope they are just busy making improvements or something. --ThaddeusB (talk) 16:12, 8 June 2009 (UTC)[reply]

I love you, WebCiteBOT![edit]

eom Mike R (talk) 22:20, 10 June 2009 (UTC)[reply]

Barnstar[edit]

		The da Vinci Barnstar
		Excellent bot, thank you. Jezhotwells (talk) 23:05, 10 June 2009 (UTC)[reply]

Should the 'bot use the "long form" of a webcitation URL?[edit]

The bot currently uses the "short form" of webcitation URLs:

"http://www.webcitation.org/5hmrynvya"

rather than the long form, which encodes the source of the item:

"http://www.webcitation.org/query?url=http%3A%2F%2Fwww.forbes.com%2Ffeeds%2Fap%2F2009%2F06%2F24%2Fap6581515.html&date=2009-06-25"

I'd encourage using the long form of the URL, so that, if at some future time, Webcitation goes down, we can find the original URL or an archive.org URL from the webcitation URL. --John Nagle (talk) 05:14, 25 June 2009 (UTC)[reply]

The original URL is always left in any article's it edits already. A database is also kept locally. Therefore, there is no need to spend all the extra characters "congesting" an article. --ThaddeusB (talk) 05:34, 25 June 2009 (UTC)[reply]

Ah, right, you're only doing template citations now, right? For non-template citations, the long form might be appropriate. --John Nagle (talk) 06:10, 25 June 2009 (UTC)[reply]

About 80-90% are templates. For the rest, it does something like this. Either way the original URL always remains in the article. --ThaddeusB (talk) 07:26, 25 June 2009 (UTC)[reply]

Indeed, the original URL should never be removed in favour of only the archived URL. We may get annoyed by the extra space taken up in the edit window by such additional code, but in the long run, it serves our and our readers' best interests. By the way, any word on current WebCitation stability? Still kind of iffy from what I've seen. — Huntster (t • @ • c) 10:28, 25 June 2009 (UTC)[reply]

It seems to be getting better, but they are still is having some problems. --ThaddeusB (talk) 15:17, 25 June 2009 (UTC)[reply]

Of course right after I say this it goes down for 48+ hours straight. The fact that it is returning a 403 error message instead of the usual "internal error" is encouraging though. Hopefully it means they are literally in the process of changing over the servers. --ThaddeusB (talk) 22:11, 26 June 2009 (UTC)[reply]

Yeah, it now specifically mentions that the site is down for maintenance. — Huntster (t • @ • c) 01:52, 28 June 2009 (UTC)[reply]

One advantage of the long Webcitation URL form is that one could try to find the page in the Internet Archive using that data. The Archive runs about six months behind real time, but there are multiple copies of the Archive in different locations (one in San Francisco, one in Alexandria, Egypt), so if Webcitation goes down, there's often a backup that a bot could find automatically. We need to think long term here. --John Nagle (talk) 03:19, 28 June 2009 (UTC)[reply]

Did you not read this earlier reply? Anomie ⚔ 05:02, 28 June 2009 (UTC)[reply]

The Wayback Machine[edit]

Hello ThaddeusB. May I first of all congratulate you with this great bot! I hope that the combination of the WebCite service, your robot, and positive commitment from all people involved will form a definitive solution against link rot on Wikipedia, a problem that has been concerning me ever since I began editing. Imagine, it could completely prevent things like this from being a problem for us.

I noted that neither on this talk page nor on the bot's approval document a possible role of the Wayback Machine is discussed. The WM is extremely slow with archiving content: it usually takes many months before a site becomes available after submission. However, having used the WM quite a lot, I know from experience that it has a lot of copies, and usually multiple copies of one page. Perhaps it is an idea if the WebCiteBOT checks if the WM already has one or more copies of a page, so that it doesn't have to submit it to WebCite?

Furthermore, in a topic on the Wayback Machine forum, a user named "Dr Gunther Eysenbach, WebCite" wrote on March 3 this year: "It is WebCite's aim to create a distributed storage & retrieval infrastructure, which would involve depositing "webcited" material in IA [the Internet Archive]. Currently this is not yet implemented, mainly due to lack of manpower to implement this. However, we will pursue this with increased priority." WebCite's main page also says: "Current digital preservation partners include the Internet Archive as well as several libraries, through which WebCite® archived material may be available." In other words, stuff at WebCite will eventually show up at the Wayback Machine too.

I don't know if you already knew/considered this, but I thought I'd just put it up here. Again, many thanks for creating this, and good luck! Cheers, theFace 14:23, 27 July 2009 (UTC)[reply]

Pslide sended me a message about above post. I gave this reply, which might be of furter interest. Cheers, theFace 20:35, 27 July 2009 (UTC)[reply]

Thank you for the information and suggestions. I was unaware of WebCite's desire to interface with archive.org. Not surprisingly I was aware of The Way Back Machine though, and have personally used it on several occasions. :) On my long list of things I'd like to do someday is create an intelligent version of DeadLinkBOT that automatically finds versions of dead links on archive.org and saves our links by replacing them with an archived version. Of course my desire to program something doesn't necessarily equate with the time to do it. :)

In regards to a few things you said on your talk page:

"Currently the number of links to WebCite on Wikipedia is still scarce."
- We're over 22k links to WebCite now, which is quite small compared to the possibilities but a significant number compared to most sources
"the 'bot seems to operate in spurts. Its current pace is clearly not enough to keep up with the stream of links, so I assume the backlog is still growing. Has this also something to do with WebCite's downtime?"
- As you can see from User:WebCiteBOT/Stats between 7000 and 10000 links are added to the queue on a typical day. Many of these are duplicates, quickly removed, or not used as sources, but still we are talking about a couple thousand new links to archive a day. I am currently purposely limiting the bot's activity as I am mindful of over extending WebCite's resources again. I am ramping it up a little each day though, and soon we should at least reach the equilibrium level. This is the main reason for the "spurtiness", but it is also do partially to the way it is designed. By design, the bot submits a batch of links for archive, waits an hour, and then tests the archives and updates the Wikipedia pages.
- Based on the "1 request per 5 seconds" rule WebCite requested, it could send up to 17K links a day to webcitation.org so there is theoretically plenty of time available for the bot to start catching up. Whether they can actually handle it or not remains to be seen though.
"perhaps WebCiteBOT could be programmed to prioritize"
- This is a reasonable suggestion on paper, but it isn't worth the trouble. Even the most popular sources account for less than .5% of the total links, so the savings wouldn't be worth the human effort to determine stability.

So in summary, I will certainly take your post into consideration and welcome more suggestions. However, I don't plan to make any immediate changes. Feel free to ask if you have any questions. --ThaddeusB (talk) 05:51, 28 July 2009 (UTC)[reply]

Thanks for the reply. I agree with your opinions, and I like your idea about a DeadLinkBOT replacing rotten links with fresh waybacked links. Now that you mention, it would indeed be kinda weird to have a bot named WebCiteBOT adding links to the Wayback Machine. But I suggested that because I thought that would be much more efficient: the content of the majority of the pages which Wikipedians cite (informative articles, news reports, interviews, reviews, tweets, etc.) will likely never change. So if there is already a copy of it at the WM, then why make another?
But anyway, that decision is yours, of course. I understand your statement that you lack the time to do what you want. That's something I am also familiar with ;-). Cheers, theFace 19:52, 28 July 2009 (UTC)[reply]

Running?[edit]

Is the bot running? When I find a dead link, how can I find the archived version? --Apoc2400 (talk) 11:29, 28 August 2009 (UTC)[reply]

It's been down the past few days due to connectivity issues on my end. Should be back running tomorrow I think.

To manually fix a dead link, you can try doing a Google search to see where it might have moved to, doing an Archive.org search, or doing a search at Webcitation.org --ThaddeusB (talk) 05:19, 29 August 2009 (UTC)[reply]

So if I specifically want the copy this bot archived, I should search at Webcitation.org? I tried with some links I have added as referenceds over the past months, but not a single one works. I always get "We do not have any snapshots of the given URL [...] in our database". Does it automatically replace dead links with the archived version? --Apoc2400 (talk) 14:15, 29 August 2009 (UTC)[reply]

That just means the bot hasn't archived the page yet. It will automatically add the archived link to the article when it does. Or, you can manually archive using http://webcitation.orf/archive if you want it o happen sooner. --ThaddeusB (talk) 14:45, 29 August 2009 (UTC)[reply]

It seems like the only real roadblock in the way of this bot is the huge backlog. Is that an issue that can be solved by adding more computational muscle? Perhaps moving the bot to the tool-server would speed up the process. Then again, the tool-server is buggy, so it might do more harm then good.Tim1357 (talk) 02:29, 16 September 2009 (UTC)[reply]

Something you should see[edit]

www.archive.org does a lot of what you are trying to do with your bot. It has crawlers constantly taking snapshots of the internet. Maybe you can alter your bot so that when it finds dead links in wikipedia, it looks them up on this website, if it finds it it could then update the dead link in the wikipedia mainspace.

ALSO

i would like to see your source code if I could.

thanks

Tim1357 (talk) 02:46, 29 August 2009 (UTC)[reply]

While Archive.org is okay, their archiving program is somewhat inferior when it comes to replicating the original look of a website, and note that even after they archive, it is six months (or longer) before the archived copy becomes available for public viewing. With WebCitation, the look of the site is normally very well preserved, the success of the archive effort is immediately known, and the copy is immediately available for use. Not to mention, the URL is much shorter than Archive.org's :) This isn't to say one should be used exclusively over the other...conversely, both have their use. — Huntster (t • @ • c) 03:30, 29 August 2009 (UTC)[reply]

It cant replace what this bot is trying to do, another bot that searced archive.org and added links where the normal URL has gone bad would be awesome. I don't think this bot is in any way ready to do that kind of thing. - Peregrine Fisher (talk) (contribs) 04:05, 29 August 2009 (UTC)[reply]

I have considered doing a bot similar to what you describe, but it is considerably more complicated than you may think. See my comments at the WP:BOTREQ thread.

A somewhat out of date version of WebCiteBOT can be found here. Please note that while I published the code, I have retained all legal rights to it & not released it for reuse. The Bot's BRFA may also prove useful to you. --ThaddeusB (talk) 05:26, 29 August 2009 (UTC)[reply]

1.2 million links, and still hasn't gotten to the A's?[edit]

That doesn't seem right. I was looking at your total on the stats page. - Peregrine Fisher (talk) (contribs) 16:40, 3 September 2009 (UTC)[reply]

Well the main problem is that it hasn't been running for the past week and half or so due to connectivity problems on my end. I did, however, finally get it restarted this morning. As of right now, there are 957K pages in the unprocessed db, of which 15k start with a number or symbol. --ThaddeusB (talk) 17:21, 3 September 2009 (UTC)[reply]

I'm not bothered that its still in the symbols. I know it's hard to get past the backlog. It's just that I think the average number of refs per article is probably 2 plus or minus 2. So maybe 3-9 million total refs in all WP. I could be way off with this. Anyways, Symbol pages are probably less than 1/30th of all pages, which would mean your bot is about to archive maybe 30 million refs. Something about my math is wrong, or about the way your logs is recording stuff. I think the problem may be in your counting. When I look at a page like User:WebCiteBOT/Logs/2009-07-24.log, it doesn't look like it is 4-10,0000 links long. - Peregrine Fisher (talk) (contribs) 20:51, 3 September 2009 (UTC)[reply]

The counting is definitely "off" if we are talking about the amount of links that get added and stay. Often the same reference will be counted multiple times because it is "added" more than once. For example, if it is added and then moved that is 2 counts. Or if it is added and then updated with a new link when info changes, or the page is blanked and reverted. Etc. --ThaddeusB (talk) 01:33, 4 September 2009 (UTC)[reply]

Tennis Articles[edit]

I am in need of this bot to go and archive the slam pages for 2009, which those pages are 2009 Australian Open, 2009 French Open, 2009 Wimbledon Championships, and the 2009 US Open (tennis). I need this to be done on next Monday the 14th of September 2009. Get back to me on this, Please!98.240.44.215 (talk) 02:53, 10 September 2009 (UTC)[reply]

It appears you were working on these articles today (the 9th). The bot waits a minimum of 48 hours before archiving new links, so these should be processed on either the 11th or 12th. --ThaddeusB (talk) 03:16, 10 September 2009 (UTC)[reply]

Thank You, For Answering My Question!98.240.44.215 (talk) 17:37, 11 September 2009 (UTC)[reply]

On demand[edit]

Any more thought on allowing people to have the bot hit certain articles? I'd love to run it on my featured articles, and I'm sure others would too. - Peregrine Fisher (talk) (contribs) 01:21, 11 September 2009 (UTC)[reply]

Sounds like a good idea to set up some sort of on-demand request system, although I don't know why the urgency to get something archived unless you know for sure that the link is only online for a very brief period. -- œ ^™ 00:53, 12 September 2009 (UTC)[reply]

Yes, it is still planned, but no it isn't imminent yet. I've just been far too busy to get to the task. --ThaddeusB (talk) 01:47, 12 September 2009 (UTC)[reply]

Also Checklinks has a function to do this.

Can a feature be added to "Cite URL" to disable the action of WebCiteBot?[edit]

This question is because the WebCiteBOT made an inappropriate alteration to a web citation in "2009 flu pandemic in the United Kingdom". I am making a weekly update to a date in a web citation in an image title after I update the image in "Wikimedia Commons" (and as the data updates on the cited web page). It's luck that I noticed the change made by WebCiteBOT, or the citation would start pointing to an out of date page. So can some way be found to disable the action of WebCiteBOT on certain web citations? A "NoWebCiteBOT" parameter on the "cite web" template or something?--Farry (talk) 17:36, 12 September 2009 (UTC)[reply]

In general, it is preferable to have the archive point to the version link the person using the reference saw. I.E. it is a good thing the link shows "outdated info". However, I certainly can see that in a situation like this that it might be beneficial to temporarily not have an archive. If you replace

archiveurl=http://www.webcitation.org/...

with

archiveurl=

that should prevent the bot from re-adding the info. (Make sure to delete archivedate=... as well since it wouldn't make sense to have a date for a non-existent archive.) If that doesn't work, let me know. --ThaddeusB (talk) 03:48, 13 September 2009 (UTC)[reply]

Works fine. Thanks. --Farry (talk) 18:32, 14 September 2009 (UTC)[reply]

FlickreviewR bot on Commons[edit]

I would like suggest a new use for the WebCiteBOT: use it with the commons:User:FlickreviewR bot to archive the flickr pages. That way, if the image is later deleted or license changed, there won't be any question about whether it was actually available under a free license at the time of the upload. It would be very helpful in substantiating claims under images tagged with this: commons:Template:Flickr-change-of-license--Blargh29 (talk) 01:45, 22 September 2009 (UTC)[reply]

If I ever get the bot working a consistent basis, I will definitely expand it to do that. --ThaddeusB (talk) 05:39, 27 September 2009 (UTC)[reply]

Backlog[edit]

The bot seems to still be in the "2"s. Will it ever get to later in the alphabet? It seems to have been in "2"s for months now. - Peregrine Fisher (talk) (contribs) 05:19, 27 September 2009 (UTC)[reply]

Yah, I've been having a lot of problems. Between the API timing out on me and WebCite inexplicably entire sequences of page as 404 when they aren't, I've had to make the bot redo the same pages over & over again. That said, it should fly through the rest of the #s after it gets past "2009..." which it almost is past now. --ThaddeusB (talk) 05:43, 27 September 2009 (UTC)[reply]

Sounds good. I look forward to the "3"s and beyond. - Peregrine Fisher (talk) (contribs) 05:51, 27 September 2009 (UTC)[reply]

Is there any way to see the backlog? See the list of articles that still need to be processed?—NMajdan•talk 16:15, 4 October 2009 (UTC)[reply]

Not directly, but I can query it locally. As of right now there are 1.1M pages in the backlog of which a large number are duplicates - probably around 300k unique pages. If you want a more precise count, let me know and I'll generate one. --ThaddeusB (talk) 16:48, 4 October 2009 (UTC)[reply]

No, I was more of less curious to see the order in which the bot may be tackling these articles. Maybe when the backlog gets down to a manageable size, you can have the bot update a page once-a-day or so with the backlog. Frankly, I was just curious if citations I have been making were going to be included in a future archival so my question wasn't entirely altruistic.—NMajdan•talk 21:53, 5 October 2009 (UTC)[reply]

Another question, at what rate are you adding to the backlog? Just trying to figure out how long before the bot is caught up. Looks like the average rate in September was about 6500/day. At 300,000 articles in the backlog now, it would take about 45 days to get caught up assuming no more articles are added. Obviously, I know this bot provides a tremendously valuable service to Wikipedia so I'm a bit more inquisitive than normal.—NMajdan•talk 21:58, 5 October 2009 (UTC)[reply]

Just following up. Still very curious about these aspects.—NMajdan•talk 14:28, 15 October 2009 (UTC)[reply]

Not sure why I never replied the first time, just slipped my mind I guess... The bot currently functions by first sorting everything into alphabetical order so that it can easily combine duplicate entries into one. Thus, pages nearly to the start of the alphabet will be processed sooner regardless of when exactly the inks were added (this only happens because of the backlog problem.)

The backlog is still growing (although naturally the rate of growth slows as a larger percentage of new entries become duplicative of old entries). I'm not sure if there has been a single day where the program to monitor additions was running and the backlog didn't grow. I don't have precise numbers, though. I finally wrote an effective workaround for the API problems - hopefully that will get it to where at least it is up to break even.

Currently the bot isn't running at all because I am making some modifications so it can rapidly archive all GeoCites links before they go dead later this month. I think it will probably be back up later tonight. Failing that, tomorrow for sure. --ThaddeusB (talk) 01:02, 17 October 2009 (UTC)[reply]

P.S. Feel free to ask as many questions as you like - I don't mind answering them at all. --ThaddeusB (talk) 01:02, 17 October 2009 (UTC)[reply]

That open invitation for questions may have not been a good one :). I was wondering if the bot requests that pages be archived all the time, so that even when it is updating the article, it is still sending requests to WebCite. If it isn't, then it seems that would be a more efficient use of time. Tim1357 (talk) 20:25, 23 October 2009 (UTC)[reply]

The may it currently works is:

pulls the first N links waiting to be archived
pulls up the associated Wikipedia page for the first/next link and checks to see if the link is still there & used as a reference
if the link is a reference, it makes sure the webpage is valid & pulls soem metadata from it
if the page is valid, it sends an archive request
return to step 2, until all N links are processed
waits a hour (per request by WebCite people)
goes through each link & checks the status of the archive
if the archive was valid, it updates the Wikipedia page (some links can't be archived due to robots.txt or no_cache settings)

It works like this because archiving isn't always instantaneous (it recent history it has been, but historically it hasn't). Again, the backlog problem isn't due to time constraints, but rather getting the code up to a level where it is stable enough to run 24/7. The current setup should be able to process ~10K unique links a day once it is fully stable and the "true" rate of unique new links being added a day is most likely under 1K. --ThaddeusB (talk) 04:08, 26 October 2009 (UTC)[reply]

uic.com.au[edit]

From WP:VPM#The web site "uic.com.au" will be closed soon, some 70 articles are affected, I was advised to make a request here. Can this bot correct those citations? --Quest for Truth (talk) 18:21, 28 September 2009 (UTC)[reply]

Yes, I will make sure it archives the links before they disappear in December. --ThaddeusB (talk) 00:05, 29 September 2009 (UTC)[reply]

Archiving links on this page[edit]

Would it be possible to have WebCiteBOT periodically archive the links on this page, to prevent any archive.org link rot or removal? If it isn't, sorry for bothering you; if it is, it would be great if you could do that. Thanks for your time. JimmyBlackwing (talk) 05:36, 30 September 2009 (UTC)[reply]

Request[edit]

Hi, please archive the references (1 and 2) on Detroit Lions Television Network. They seemed to have died. Thanks. TomCat4680 (talk) 21:52, 30 September 2009 (UTC)[reply]

Also do so for ref 18 on Detroit Lions. TomCat4680 (talk) 21:56, 30 September 2009 (UTC)[reply]

If a link has already died, it cannot be archived (as in, there's nothing there to archive). You'll have to try finding it at http://web.archive.org. — Huntster (t @ c) 22:15, 30 September 2009 (UTC)[reply]

Also Checklinks is a useful tool.Tim1357 (talk) 02:02, 1 October 2009 (UTC)[reply]

Concerns about webcitation.org[edit]

I was at the .pst article and wanted to see the source reference. Normally I hover the mouse over a link before I jump and was surprised to see it was pointing to something which conceals the ultimate destination much like tinyURL does.

The problems I'm seeing with links to www.webcitation.org are:

When looking at links I regularly hover the mouse over the link and look at the status bar to see what site is being linked. Usually that's enough for me and I don't click. Converting the links to use the Webcitation.org web site breaks this feature.
I regularly use Special:LinkSearch to see if a particular web site is in "good standing" as far as being a source of references. Having many, and presumably at some point, all, of the article reference links converted to use webcitation.org destroys the usability of Special:LinkSearch.
The WP Foundation takes WP user privacy seriously. There are many policy hurdles to get the IP address of an editor via CheckUser. You'd need to subpoena the foundation to get the IP address of a reader or to have them tell you which pages an IP has visited. Webcitation.org is able to collect three elements for every person that follows a link off Wikipedia. 1) The person's IP address, 2) the Wikipedia page the person was on (referrer tag), 3) The web page the user is interested in, 4) the date/time the request was made, and 5) other information that web browsers sends to web sites such as the operating system, browser used, etc.
Related to Wikipedia user privacy is that webcitation.org is tracking users via cookies meaning that they will be able to tie my use of links from Wikipedia to the use of links from other web sites.
Webcitation.org's privacy policy is not reassuring with "From time to time, we may use customer information for new, unanticipated uses not previously disclosed in our privacy notice." While Canada's privacy rules are better than what's in the USA I see that the webcitation.org servers are physically in Texas meaning USA rules apply.
Wikipedia has a spam blacklist intended to prevent links links to certain sites from getting added to Wikipedia pages. Webcitation.org allows editors to circumvent the blacklist because the link to webcitation.org looks like http://www.webcitation.org/5k40hOrFo where "5k40hOrFo" is a random value that's only meaningful to the webcitation.org web servers.
Wikipedia editors frequently evaluate the potential of links by inspection alone. If I see a link to someuser.blogger.com then I know the odds are low it'll be a reliable source. Webcitation.org breaks evaluation by inspection and instead forces editors
Webcitation.org is a private web site that has full control over the outer frame. At present this frame is a plain blue bar. Their policy contains no prohibition against inserting advertising in the frame or even modifying the content they are showing in the lower frame.
People are are not familiar or aware of webcitation.org will believe the content they are viewing comes from webcitation.org. When I was on the .PST article and seeing the link to www.webcitation.org/5k40hOrFo I assumed this would work like tinyurl and redirect me. I clicked and was thinking "Why would a site called www.webcitation.org have a page about file extensions?" I was then thinking this is was a bootleg site where someone had stolen a www.FILExt.com page and uploaded it to webcitation.org.

I would like the WebCiteBOT modified so that it adds articles to webcitation.org as it does now but that any links in the article go to the orignal web site. The webcitation.org link can be maintained as a comment visible to Wikipedia editors. Should the original site fail then an editor would see the commented out archive link and could start using that pending finding an appropriate substitute for the original site. There's no reason at all for live links from Wikipedia to this site if the original site is available. Webcitation.org would still serve its (very useful) function of archiving content so that links to citations will not rot completely should the source site be updated or removed. Webcitation.org is also useful in that it can store older versions of the page.

The existing webcitation.org links should also commented out and links to the source web site restored.

I have also asked that adding links to webcitation.org be blocked at MediaWiki talk:Spam-blacklist#webcitation.org. --Marc Kupper|talk 04:33, 13 October 2009 (UTC)[reply]

Webcitation.org and the Internet Archive are the best tools we have to combat linkrot, which I believe to be one of the biggest threats to the content of Wikipedia. Please let me address some of your concerns. First, Webcitation.org is not like tiny.url, but rather it is an archive of web content, which is clearly shown by the blue frame in every Webcitation.org archive. It's used in many academic journals. Second, I really doubt that any spammers, whose business relies on the quick and robotic insertion of their links, would use webcitation.org, which requires a 2 step process to create an archive and requires an email address.
But, here's a workable solution to some of your other concerns about the hover-over url identification. Instead of asking to comment out the Webcitation.org archived pages, which would totally defeat their purpose, I suggest that the citation Wikipedia:Citation templates be altered to show the original url as the main link and the archived link as the secondary url. This would be an easy fix at the template-level and would assuage any of your concerns, while preserving the work of the many people, including myself and ThaddeusB, who have taken the time to preserve Wikipedia's web sources for the future. pehaps the discussion can take place at Wikipedia:Citation templates. --Blargh29 (talk) 05:15, 13 October 2009 (UTC)[reply]
- The counter to that suggestion is that when the original location is known to be dead, we most likely do want the "main" link to go to the archive copy. WebCiteBOT already adds a "deadurl=no" parameter to citation templates when adding the archiveurl parameter, so the change could very easily be done only to citation templates where that parameter is specified and the current behavior retained when "deadurl=yes" or deadurl is unspecified.
  As for the original complaint, it shows a fundamental misunderstanding of the situation. Point 2 is bogus since the original link is always retained by the bot and should be by any human editor. Point 3 applies to every external link to any site (do you want to ban all external links?). Point 4 also applies to pretty much every external link as well, and I note that webcitation.org's cookie is just the default PHPSESSID used for PHP session handling and will be removed when the browser is closed (which is far better than you'll get on many sites commonly used in external links). Point 5 seems needlessly reactionary, as the quoted section of their privacy policy is geared towards people who actually create an account there rather than visitors; regarding visitors, they basically collect what would already be in the web server log anyway (Wikipedia does that too! Oh noes!). Point 6 is easily enough refuted considering that point 2 is bogus: any webcitation.org link without a corresponding original link can and should be subject to scrutiny. Point 7 is basically a duplicate of point 1 and depends on the bogus point 2 to really make any sense. Point 8 is again rather reactionary, as many sites used in external links (for example, pretty much every "mass media" news site) already has advertisements. And as for point 9, there's no accounting for people jumping to conclusions instead of taking the time to click the "What's this?" link in the blue header. Anomie⚔ 11:49, 13 October 2009 (UTC)[reply]
  - I pretty much agree with the above responses... The use of the archived as primary is done by the template and not the bot, so that really is beyond my control. The correct place to ask for the original to be primary is the template talk page. Past discussion about which should be primary has been fairly evenly split, so no change has been made thus far. In any case the fact that the link is archived is clearly indicated in the reference itself: "Archived from the original on 2009-01-01." If you are interested in seeing the original without going through webcitation.org, all you have to do is click (or hover over) that link instead. The rest of the complaints pretty much are true of every external link (and indeed some external links are far worse, containing for example, malicious script exploits.) --ThaddeusB (talk) 14:52, 13 October 2009 (UTC)[reply]

I'd like to answer some of this today but was without power for a good part of the day plus running around dealing with all of the loose ends a season's first major uncovers. --Marc Kupper|talk 08:44, 14 October 2009 (UTC)[reply]

WebCiteBot on other wikis[edit]

Hi,

do you run WebCiteBot on other wikis? Or is the code public? I'm looking for a way to preserve Geocities references on hu.wikipedia, and Webcite seems like the obvious choice. --Tgr (talk) 19:13, 13 October 2009 (UTC)[reply]

I do have plans to eventually expand it beyond enwiki, but so far it isn't yet stable here so I haven't pursued the matter. I haven't released the code under GFDL/CC-BY-SA, but even if I did it wouldn't do you much good as it would need to modified to run on a foreign wiki. Plus each wiki has their own bot policies, so I'd have to investigate that and possibly get approval for each one - in short it is a time consuming matter which I haven't pursued yet.

Now, in regards to GeoCites links, you raise a very good point. I won't have a bot up and running to add the archived links to xx.wikipedia before the end of the month, but what I can do it have it go ahead and do the actually archiving for links found at hu. (and elsewhere) so that an archived version of the link will at least exist, which can then be manually updated or bot updated at a later date. --ThaddeusB (talk) 15:46, 14 October 2009 (UTC)[reply]

Thanks, I would appreciate if you could do that. (And I would definitely support if you wanted to run WebCiteBot as a global bot, though I suppose compatibility with the various cite templates would be tricky.) --Tgr (talk) 20:14, 15 October 2009 (UTC)[reply]

Wikipedia talk:WikiProject Spam/LinkReports/webcitation.org[edit]

We should keep an eye on this as well: Wikipedia:WikiProject Spam/LinkReports/webcitation.org--Blargh29 (talk) 13:59, 14 October 2009 (UTC)[reply]

I don't put much stock in those pages since they don't seem to care whether they list various well-known bots (e.g. AnomieBOT gets on their lists fairly often when it fixes orphaned references with a link that their bot doesn't like, and nothing ever seems to come of it). I suppose WikiProject Spam finds some use in them, but what that might be I don't know. Anomie ⚔ 16:35, 14 October 2009 (UTC)[reply]

Expanded linkrot policy[edit]

I think that Wikipedia needs a stronger linkrot policy. A page that explains 1)What linkrot is 2) Why linkrot is a problem 3)What can be done to prevent it (aka WebCite) 4)What can be done to repair it (aka Internet Archive) 5)How to mitigate unfixable rotted links. Some of this is covered by WP:DEADREF, but that information needs to be beefed up.

But, the first step is to rename Wikipedia:Dead external links to Wikipedia:Dead links, so that the policy is clear that it applies to ALL links, including inline citations, and not just those in the "External links" sections. Please make your comments at Wikipedia talk:Dead external links#New name for this page?.--Blargh29 (talk) 03:26, 20 October 2009 (UTC)[reply]

Wikipedia encourages people to be BOLD, so if you think the current policies need better explanation go ahead and modify them. If you think we need a new set of instructions, go ahead and draft one (in user space if you like) and I'll be happy to look it over.

An overhaul of WP:Dead external links is on my agenda (it i horribly out of date and poorly organized), but I have no idea when I'll get to it. --ThaddeusB (talk) 12:56, 20 October 2009 (UTC)[reply]

Other wikis[edit]

Hi, what about running this bot on other wikis as well? --Nemo 12:13, 27 October 2009 (UTC)[reply]

Ah, there's already #WebCiteBot_on_other_wikis. Well, If you want I can translate templates, check policies, make request etc. for you on it.wiki. --Nemo 12:15, 27 October 2009 (UTC)[reply]

Great, thanks for the offer. I'll get back to you when I'm ready to port the bot. --ThaddeusB (talk) 14:50, 27 October 2009 (UTC)[reply]

Bot Status[edit]

No contribs since 11/1. Is the bot down/broken? Or is there just a delay in getting the bot switched back from Encarta/GeoCities archiving to general archiving (we all can understand that real life can get in the way). Just curious.—NMajdan•talk 19:13, 9 November 2009 (UTC)[reply]

The lack of editing was just do to real life time cramp. --ThaddeusB (talk) 02:09, 10 November 2009 (UTC)[reply]

When will the bot be operational again?—NMajdan•talk 20:25, 18 November 2009 (UTC)[reply]

Looks like the bot made about 200 edits after it was restarted on 11/21 but has again been down for a week. When will the bot be fully operational again?—NMajdan•talk 16:14, 2 December 2009 (UTC)[reply]

After this week, I will be on real world vacation and have a lot more time for Wikipedia. Thus, you can expect it running full time by next week at the latest. --ThaddeusB (talk) 00:02, 3 December 2009 (UTC)[reply]

AHH BOT ERROR![edit]

diff Hey man, saw the bot got up again! Great job, but you have a space before your new references that puts them in a box. Scroll down through the page and you'll see what i mean! Peace,

Tim1357 (talk) 01:31, 23 November 2009 (UTC)[reply]

Thx for the notice. I have adjusted the code accordingly. --ThaddeusB (talk) 01:46, 23 November 2009 (UTC)[reply]

Support for other archiving services besides webcitation.org?[edit]

There is a very useful site where articles expire very quickly. Webcitation.org doesn't work with the site. However, freezepage.com, for example, does. Would it be possible to add freezepage support to the bot? (Or maybe some other archiving service that works.) Offliner (talk) 18:51, 30 November 2009 (UTC)[reply]

One has to be careful with newer archiving sites, as there are legal issues involved with copying content that they may or may not have looked into. That said, I will definitely look into the suggestion. --ThaddeusB (talk) 21:42, 30 November 2009 (UTC)[reply]

Haven't tried it yet, but I don't think FreezePage is going to be a good alternative. From their FAQ: "To save space on our system, we require that you use your account regularly, i.e. that you log in or visit any page on our site. If you are an unregistered user, you must visit our site every 30 days. If you are a member (sign up for free), we only require you to log in every 60 days. If you don’t, we may delete your account and all the frozen pages in it." - Kollision (talk) 16:10, 23 June 2010 (UTC)[reply]

Disappearing source: Editor & Publisher[edit]

Wikipedia has over 600 links to the legendary publishing periodical Editor & Publisher, which is now ceasing publication. The list is here. I suspect that the website will soon be shuttered as well. Is WebCiteBOT able to be deployed to WebCite these links before they disappear? Please see Wikipedia talk:Linkrot to help coordinate. --Blargh29 (talk) 03:32, 11 December 2009 (UTC)[reply]

I miss you[edit]

Hey WebCiteBOT

I was wondering when you would be able to start up again. I know you have been busy with Geocities and all. Good luck

Tim1357 (talk) —Preceding undated comment added 14:52, 10 January 2010 (UTC).[reply]

He left a comment on his regular user page saying that he is back to normal access after over a month of limited access. So I would assume he would begin on this pretty quick. Fingers crossed its up and running by the end of the month. This is an extremely useful bot and brings a lot to Wikipedia, so its a shame its been down for so long. I wonder if the backlog has continued to build?—NMajdan•talk 22:47, 12 January 2010 (UTC)[reply]

Yes, your speculation is accurate - I will get the bot back up ASAP. It isn't #1 on my agenda, but its near the top. The backlog has continued to grow, although at a slower than normal rate since the logging program also wasn't online all the time. I will post a status update w/in a few days. --ThaddeusB (talk) 02:46, 17 January 2010 (UTC)[reply]

Update? Your talk page says you're about caught up. Hopefully this bot will be back up and running soon.—NMajdan•talk 21:09, 2 February 2010 (UTC)[reply]

I am hopeful it will be up soon as well. --ThaddeusB (talk) 20:11, 6 February 2010 (UTC)[reply]

@WebCiteBOT: Miss you big time! Great job earlier. Hope it will resume soon. Nsaa (talk) 08:54, 9 February 2010 (UTC)[reply]

News?Tim1357 (talk) 02:58, 13 April 2010 (UTC)[reply]

Last update.—NMajdan•talk 13:22, 13 April 2010 (UTC)[reply]

URGENT: NY Times and WebCiteBOT[edit]

The New York Magazine is reporting that the New York Times is going to cease providing free content and will install a "metered" payment system. Is it possible for WebCiteBOT to archive NY Times articles before this happens?--Blargh29 (talk) 22:39, 18 January 2010 (UTC)[reply]

Perhaps this is the last gasp of NY Times. This should definitely be a priority. — Huntster (t @ c) 00:19, 19 January 2010 (UTC)[reply]

It has been announced that the NY Times will begin the pay model in 2011, so we have all of this year to archive NYT articles.—NMajdan•talk 14:42, 20 January 2010 (UTC)[reply]

Very good, thanks for the update NMajdan. Sometimes companies like to jump into such things quickly...glad this is not the case. — Huntster (t @ c) 21:02, 20 January 2010 (UTC)[reply]

Let's not panic. IIRC, the NYT plan is to allow IPs a few articles per month. So one could still access them, just not in bulk. --Gwern (contribs) 19:20 2 February 2010 (GMT)

Actually, I agree with Blargh29 that we should do the archiving as soon as possible. Looking at this VPM thread, I see that The Times (of London) has added NOARCHIVE to its pages in anticipation of its move behind a paywall, in which cases webcitation.org will not grab the content. I think there's a reasonable risk that NYTimes will do the same thing. We should archive these pages while we still can. user:Agradman editing for the moment as 160.39.221.164 (talk) 06:33, 4 May 2010 (UTC)[reply]

Priority : Archiving of BBC News articles[edit]

BBC has announced that several sections of its old websites would be axed and its old content pruned, owing to a funding shakeup to BBC Online. I'm concerned that this is likely to include old versions of BBC News articles dating back to 1999, which an awful lot of articles heavily depend upon for reliable sourcing (some of them the only source, in fact). I think we should start converting them into WebCites before they are removed and then we'll have a huge sourcing problem in our hands. - Mailer Diablo 16:25, 2 March 2010 (UTC)[reply]

Times / Sunday Times[edit]

More news: The Times / Sunday Times will charge from June. Rd232 ^talk 07:33, 26 March 2010 (UTC)[reply]

WebCiteBOT is not operating[edit]

It seems the bot has made no edits since November 2009. Any hints why? User:LeadSongDog come howl 15:36, 13 April 2010 (UTC)[reply]

Look a couple threads up.—NMajdan•talk 15:52, 13 April 2010 (UTC)[reply]

I saw that, but it doesn't give any hints why the bot is down, just says that it is down. There are ways other users might be able to help, with bug reporting, analysis, code inspection/review, test cases, etc, but right now, we're in the dark as to the problem. User:LeadSongDog come howl 16:44, 13 April 2010 (UTC)[reply]

I hope to have it up again tonight or tomorrow at the latest. --ThaddeusB (talk) 23:14, 25 April 2010 (UTC)[reply]

Great to hear, Thaddeus. — Huntster (t @ c) 01:02, 26 April 2010 (UTC)[reply]

That is great, but should a bot request be made to have a second bot that does webcite citations? - Peregrine Fisher (talk) 02:21, 26 April 2010 (UTC)[reply]

What do you mean, webcite citations? This bot takes newly added references and archives them using WebCitation. I just want to clarify your question before Thaddeus responds.—NMajdan•talk 12:12, 26 April 2010 (UTC)[reply]

I mean the same thing that this bot does. I was the one who made the original bot request, so I know how it works. It's really cool, when it's working. - Peregrine Fisher (talk) 14:35, 26 April 2010 (UTC)[reply]

Ha. Ok. You're wanting a duplicate bot. And I would agree. A duplicate would be nice. Once this one gets up and running and stabilizes, I'd like to see the ability to do certain articles on demand. Sorry for the confusion.—NMajdan•talk 15:03, 26 April 2010 (UTC)[reply]

User:WebCiteBOT/Stats[edit]

It says it's done about a million links this year, but looking at its contributions, it seems to be around 500. Any idea what's going on, or am I reading it wrong. - Peregrine Fisher (talk) 03:48, 29 April 2010 (UTC)[reply]

That's how many website URLs it has collected from articles, I believe. It then sorts through them and attempts to archive them. — Huntster (t @ c) 05:46, 29 April 2010 (UTC)[reply]

I was just about to ask that as well. What Huntster said was my conclusion as well. It seems this whole time WebCiteBOT has been down, it has still been collecting new references in its database which it will archive when it comes online. He definitely has his work cut out for him!—NMajdan•talk 15:11, 29 April 2010 (UTC)[reply]

Looking at the last entries in WebCiteBot's contributions and log they are available through webcitation.org's query page. It looks like it's a simple issue of the bot no longer requesting the additions since November 2009. If the bot's maintainer can't spare the time to get it running, there may be no alternative to creating another bot. I must say though that recently webcitation.org's parent organization University Health Network has been devoid of any mention of webcitation. Perhaps someone should contact Gunther Eysenbach to find out what's going on? LeadSongDog come howl 16:47, 29 April 2010 (UTC)[reply]

Why don't we get a second bot going? Even if this one worked perfectly, we should probably have a backup. It looks like there are several people aware of the situation right now, so we could work on it together (it's mostly just a bot request). I can make the request, but I don't spend much time on wiki anymore. So, if someone who's more active would do it, that would be best. If no one else wants to do it, I'll try and get it done in the next week or so, but as I said, I'm not that active. We should probably drop Thaddeus a line on his talk page as well. I think he doesn't give out his code (or free license it), so the new bot may have to be created from scratch. But, if we ask nice, he might help someone else get up and running really quickly. - Peregrine Fisher (talk) 03:20, 30 April 2010 (UTC)[reply]

We probably do need a second WebCiteBOT. Primarily because there are so many sources on Wikipedia that one bot can't be expected to handle them all. And lately, with GeoCities, Encarta and now the Times, there is constantly some source that is going offline that requires priority handling that the general references don't get archived. Also User:WebCiteBOT has been offline since November and the bot running has posted numerous times that the bot would be running again soon but nothing has ever happened (here, here and here). The bot operator was busy in the real world for the last part of 2009 and first part of 2010, but it appears he has resumed normally Wikipedia activity so it is obvious this bot is just very low on his priority list. Because of the good this bot can do, I would really like to see a second bot, even if the current bot resumes normal activity. The current bot has a database of references that have been added since it went live, but (I don't believe) it will archive references that existed before it went live, so a new bot could handle those. Now, I don't know how to create bots, but I will support your request if you do request one.—NMajdan•talk 13:21, 4 May 2010 (UTC)[reply]

(redent) All you have to do is make a friendly request at Wikipedia:Bot requests in Plain English. Other people do the actual coding. - Peregrine Fisher (talk) 18:16, 4 May 2010 (UTC)[reply]

I am aware of that. I've made a similar request there before so I think it best if someone else handles the request.—NMajdan•talk 18:29, 4 May 2010 (UTC)[reply]

Looking again at my previous request, it seems WebCiteBOT's owner said that WebCitation limits bots to one query every five seconds. Now, I do not know if that it limits each bot to one query every 5 seconds or every bot to one query every 5 seconds. If the former, then no big deal. If the latter, then the two bot owners would have to tailor their bots to not violate this. For instance, the two bots would probably have to limit their bots to one query every 10 seconds. Regardless, we really do need at least one bot up and running.—NMajdan•talk 14:13, 5 May 2010 (UTC)[reply]

As a simple suggestion, building the database can be done by one or both bots, independently from sending requests from it to WebCitation. I'd suggest both bots should have an active and a watchdog mode. In the latter mode, perhaps send just one request an hour, to be sure that a)both bots are working and b)WebCitation is working. In the former mode, burn through the database as fast as WebCitation.org is willing to let us go.

It's also worth asking WebCitation.org if it would make a difference to their server loading if we send the requests grouped (or ungrouped) by host order. One obstinate host might tie them up a long time in the first case, or they may already have code to manage that. LeadSongDog come howl 22:13, 5 May 2010 (UTC)[reply]

It could also help if ThaddeusB ran the bot soon

. Thaddeus, do you think you could graduate the bot to a Crontab job so we don't have to bug you to run it? Tim 1357 ^talk 21:11, 13 May 2010 (UTC)[reply]

What has Thaddeus said lately? I missed it if he's commented somewhere. - Peregrine Fisher (talk) 04:41, 14 May 2010 (UTC)[reply]

He's still active on other articles. LeadSongDog come howl! 05:51, 14 May 2010 (UTC)[reply]

UPDATE 2010 Aug 25 - Related conversation at Wikipedia:Bot requests/Archive 37#WebCiteBOT still down, replacement growing more urgent - Hydroxonium (talk | contribs) 01:08, 25 August 2010 (UTC)[reply]

Ah, this is terribly discouraging. I guess we'd better get moving on another bot.LeadSongDog come howl! 20:24, 1 October 2010 (UTC)[reply]

UPDATE 2011 Feb 23 - ThaddeusB posted ":Note: I have just returned to Wikipedia and hope to have the original WebCiteBOT back up and running within the next few days unless people object to me doing so here --ThaddeusB (talk) 23:33, 22 February 2011 (UTC)".[1] — Jeff G. ツ 22:33, 23 February 2011 (UTC)[reply]

UPDATE 2011 May 24 - Unfortunately, Special:Contributions/ThaddeusB tells us that good intentions came to naught. We have to get on with an alternative.LeadSongDog come howl! 16:53, 24 May 2011 (UTC)[reply]

http://webcitation.org/ at 10:24, 12 May 2010 (UTC): "WebCite is currently under maintenance We will be back up soon. "[edit]

Another reason why we should build our own. I am having nightmares that one day they are going to break the entire (scholarly) internet.

Anyhow back to exams... AGradman / talk. See User:WebCiteBOT at 10:24, Wednesday 12 May 2010 (UTC)

Little Thetford and WebCite[edit]

Running Little Thetford through the WebCite comb produces 673 possible URL's, each with a tick box. Many of them are duplicates and many others are wikilinks or wikipedia maintenance pages. Manually trawling through each identified URL will take a while. A little research revealed a potential solution, WebCiteBot, which, if I understand correctly, will do the job for me. In particular, I believe it will archive the references that contain urls and then edit each such reference to include archiveurl and archive date! Magic! Is the bot still working? Is there a version that can be targeted on one page? ~~How much money does the author of WebCiteBot want?~~ --Senra (talk) 13:14, 2 August 2010 (UTC)[reply]

webcitebot can make wikipedia money![edit]

I posted this project on the Bounty Board, as explained in this post at the village pump. Good luck and best wishes. AGradman / talk / how the subject page looked when I made this edit 18:13, 11 November 2010 (UTC)[reply]

WebCiteBOT edit 'Belgium' article on 31 October 2009[edit]

WebCiteBOT edit [2] inserted |archiveurl=http://www.webcitation.org/5kwPxLurr|archivedate=2009-10-31|deadurl=yes twice. The first insert directs to the Encarta Encyclopedia as expected, the identical second also - as definitely NOT expected: it should show (the content of) a .pdf from an entirely different source that is (at least today) a dead url. Can this still be corrected for the proper webarchive retrieval?

Please, find the cause of this apparent malfunction, and try to find out where else it may have occured so as to correct it there as well.
▲ SomeHuman 2011-01-28 17:04 (UTC)