Wikipedia:Reference desk/Archives/Miscellaneous/2020 February 25

From Wikipedia, the free encyclopedia
Miscellaneous desk
< February 24 << Jan | February | Mar >> February 26 >
Welcome to the Wikipedia Miscellaneous Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


February 25[edit]

Exercise reference[edit]

In terms of health benefits, the amount of recommended exercise depends upon the goal, the type of exercise, and the age of the person. Even doing a small amount of exercise is healthier than doing none. Is this saying that this reference https://www.nhs.uk/live-well/exercise/ is different or same to what its saying. Recommending to follow the reference or not? I just need help understanding what this means. — Preceding unsigned comment added by 2001:8003:7427:6B00:8869:9463:4556:B500 (talk) 12:03, 25 February 2020 (UTC)[reply]

I understand the NHS site, which I think we can regard as reputable, is saying that there are several ways to reach your exercise target. You can do some intensive exercises over a short period or less intense exercises over a longer period of time. The important thing is to do some exercise regularly. With out getting breathless the more exercise you do the better it is for your health. Richard Avery (talk) 12:15, 25 February 2020 (UTC)[reply]
I'm sure we'd all recommend that a person inquire with a health professional to get started exercising. If one has been sedentary for a long time, it's probably best to start with something low-impact like swimming to avoid injury. Temerarius (talk) 05:39, 29 February 2020 (UTC)[reply]

Problems using the wikipedia dump bz2 file[edit]

Hi, I just downloaded enwiki-20200201-pages-articles-multistream.xml.bz2 .

I tried to open it with wikidumpparser. It crashed on the first line, saying "System.Xml.XmlException: 'Unexpected end of file has occurred. The following elements are not closed: mediawiki. Line 45, position 1.'" I extracted the xml file. Too big for notepad++. I tried using firstobject xml editor, which I have used to open xml files larger than a few Gigabites before. Firstobject xml editor just closed without any error message when I tried to open the xml.

I would prefer to open it in .net, but any language will do. I just want a program to be able to look up many articles, and I thought a local file would be better than many calls to the online wikipedia. Perhaps I was wrong. Do you have any suggestions? Is there something wrong with the dump? Is it a newer format? Do you have any suggestions on how to access it at all or alternatively for accessing the online wikipedia often in the best way?

I have used a few .net libraries some years ago, but as I read the documentation, nowadays I need permission, and probably special permission if I want to make many calls. So I thought the dump might be an alternative, not to disturb anyone too much. Star Lord - 星爵 (talk) 20:55, 25 February 2020 (UTC)[reply]

This sounds like something for the computing desk -- I put this link there, so as to not duplicate posts. 2606:A000:1126:28D:8095:BB24:F64A:E5FC (talk) 03:21, 26 February 2020 (UTC) . . . or somewhere at Wikipedia:Village pump[reply]
I expect that combined, those bz2 files uncompressed are in the 100GB range. Maybe you can run something like "bzcat filename.bz2 | wc" (unix command, I don't know how you'd do it in windows) to see the uncompressed size of an individual file. Or can you try your parser with an earlier version of the same dump? It is possible that the one you tried is broken in some way, but having two separate ones fail is less likely.

In general, for very large XML files you have to parse with a streaming, SAX-style parser where you read just one tag at a time, instead of trying to keep the whole document in memory. That means your program has to "manually" remember where it is in the file (i.e. inside some stack of nested tags). If you use Python, Elementtree] is nice in that it lets you get subtrees of reasonable size and access their contents in a DOM-like fashion. 2601:648:8202:96B0:C8B1:B369:A439:9657 (talk) 08:54, 26 February 2020 (UTC)[reply]

By the way, I usually use expat for stuff like this, from C or C++ programs. It is maybe considered old fashioned but I'm used to it. libxml2 is newer and might be preferable. It is basically similar though, so I haven't bothered trying to switch to it. There are Python and maybe .NET bindings for both expat and libxml2. 2601:648:8202:96B0:C8B1:B369:A439:9657 (talk) 09:04, 26 February 2020 (UTC)[reply]
It looks to me like you should get the bz2 files that are split into smaller streams: see "2020-02-22 04:25:49 done Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream" on the dump page https://dumps.wikimedia.org/enwiki/20200220/ . That will give you a bunch of smaller files that should mostly be easier to parse, though a few of them will still be pretty large. There are some index files too. I didn't examine them to check their format. 2601:648:8202:96B0:C8B1:B369:A439:9657 (talk) 09:25, 26 February 2020 (UTC)[reply]
You should read Wikipedia:Database download if you haven't. --47.146.63.87 (talk) 10:17, 26 February 2020 (UTC)[reply]