Wikipedia:Reference desk/Archives/Computing/2015 January 30

From Wikipedia, the free encyclopedia
Computing desk
< January 29 << Dec | January | Feb >> January 31 >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


January 30[edit]

Urdu keyboard for Nokia Lumia[edit]

Is it possible to have Urdu keyboard for Nokia Lumia somehow? The list of phone's keyboards doesn't have that. 88.192.251.144 (talk) 04:44, 30 January 2015 (UTC)[reply]

Try http://store.ovi.com/content/272711 or http://forum.xda-developers.com/showthread.php?t=1587153 Justin15w (talk) 16:13, 30 January 2015 (UTC)[reply]

Alt code no keypad[edit]

Unable to get Alt code to work on a Win7 machine with no numeric keypad. The possible solution suggested here does not work, it just beeps for each number keypress. Any way around this? ―Mandruss  07:44, 30 January 2015 (UTC)[reply]

I don't think so. I tend to keep my Character Map open when I need extended characters; it works a bit like the "Special characters" feature in the Wikipedia edit window.--Shantavira|feed me 08:40, 30 January 2015 (UTC)[reply]
There are compose key applications for MS-Windows. If you enter Latin diagraphs frequently, you may find it more useful than looking up the correct UTF-8 number. LongHairedFop (talk) 11:49, 30 January 2015 (UTC)[reply]
It's also possible to buy a USB numeric keypad (or use a full USB keyboard) to make things easier. StuRat (talk) 15:08, 30 January 2015 (UTC)[reply]

Reverse engineering[edit]

Special:Random led me to the Crackme page; I was surprised to see that it's difficult for knowledgeable people to reverse-engineer a program. Since software necessarily results in the zeroes and ones going to the CPU, why is it difficult for the human user to be able to get the code in some fashion, whether machine code or something more readable? Couldn't you just open an .exe file in Notepad, or something like that, and get the code? Just curious: I'm not planning on attempting this kind of thing. Nyttend (talk) 13:26, 30 January 2015 (UTC)[reply]

It depends on what you think of as the "code" and what you mean by reverse engineer (e.g. what are the goals?). It also depends on what you mean by "difficult" - lots of people do this, it just isn't something that you can pick up and do in an afternoon. Some techniques are described at Reverse_engineering#Binary_software_techniques.
The general reason that it's "hard" is because Compilers are sort of like a one-way function (n.b. pedants, this is just a vague analogy :) The point is, there is no simple way to decompile an exe and get source code back. There are things called decompilers, but I don't think they work with much generality. This is part of why people started GNU and similar open-source software (products/development groups, licenses, etc.). Sure, you can open up an exe file and look at it, but what will you do? If you change even a single character, the whole thing might break. So you can't usually alter things at that level. If you want to share it with a friend, you might run in to built-in copy protection. There's a lot more to it, but I think the key is that compilation is a many-to-one operation. SemanticMantis (talk) 15:00, 30 January 2015 (UTC)[reply]
I use an excellent free-software debugger, lldb. It can disassemble almost all executable file formats, including most variants of what you can call "an .exe file," and present me with, e.g., the x86 machine code in human-readable mnemonics. However, this does not mean that the code is easy to understand or modify. I think Nyttend's conceptual stumbling block is that he misapprehends the absolutely vast conceptual chasm between having code - in any form - and understanding this code in a way that is suitable to modify its control flow. This task is challenging even in high-level languages. In low-level representations like machine-code mnemonics, the task is more challenging because it requires very thorough knowledge of overwhelmingly-complex modern CPU architectures. Nimur (talk) 16:17, 30 January 2015 (UTC)[reply]
1) It wouldn't be readable in a regular text editor, as many of the characters are nonprinting characters, like ESC. There are special editors, like BEAV (Binary Editor And Viewer) which display those characters as HEX codes or something else readable.
2) Making sense of those HEX codes is also quite difficult, as it's written in machine language, and you'd need to know exactly how that computer interprets it to figure out what it means. Even then, that would be quite a difficult task. For example, all the comments and variable names are stripped off, so figuring out exactly why the program is doing what it does at each step may be a mystery. StuRat (talk) 15:03, 30 January 2015 (UTC)[reply]

(edit conflict)

You are talking about a decompiler. It is trivial to get the assembly out of program (disassembly), unless it is self-modifying. However, for all but the most simple programs, understanding what it is doing is not easy. The are no variable or function names to guide you, nor is the layout of the data structures available. An optimising compiler will inter-leave the CPU instructions for each source statement in the original code. You can use information from the operating-systems calls to guide you, as they are in a fixed known format, and that knowledge can be used understand the meaning of a variable, and then the variable can be tracked as it passes into other function. However, IIRC, decompilers need a lot of guidance from their users. LongHairedFop (talk) 15:03, 30 January 2015 (UTC)[reply]
Here is a relevant TedTalk by Chris Domas, about reverse engineering firmware and the like. http://www.ted.com/talks/chris_domas_the_1s_and_0s_behind_cyber_warfare Fractal618 (talk) 15:59, 30 January 2015 (UTC)[reply]

To add to Nimur's (excellent) response, most compilers optimize the source in ways that, while more efficient, are very difficult to trace back to the original intent of the source. Some code is also intentionally designed to be difficult to reverse engineer (either through obfuscation, or inserting code that monitors the debugging interrupts and alters the code's execution. Then there is the simple fact than after you run the executable through a dissassembler, you have to distinguish the substantive program code from the file header, boilerplate code (i.e. code that the compiler adds to each program to set up a stack and transfer control to the main procedure), and data. ALL of this looks the same: in particular, there is no way to tell what part of the dissassembled file is code and what is data by examining the file itself. You must rely on what you know about the language convention, compiler, operating system, and processor. And if the program interleaves code and data--a fairly common practice for lookup tables or jump tables--then some of this data will have values that are valid opcodes. Is the string "movl ax, 1" an opcode that will be executed, or an interleaved data element?

Can you work around these difficulties? Sure, in some cases: visit http://www.nesdev.com or http://www.smspower.org to see how people have reverse-engineered NES and Sega video games (if you REALLY want your mind blown, visit http://adamsblog.aperturelabs.com/2013/01/fun-with-masked-roms.html, where they describe how to retrieve the machine code from a mask ROM with powerful solvents and image recognition software. My hat is off to these guys!). Finally, for a great discussion of the issues I've brought up, complete with examples, visit https://en.wikibooks.org/wiki/X86_Disassembly, which provides information on reverse engineering 80x86 machine code. OldTimeNESter (talk) 18:47, 30 January 2015 (UTC)[reply]

Response to everyone: what's the point of something like Crackme? I figured the point was to enable a professional or highly-experienced amateur to be able to get the machine code from the .exe. I'm not asking how I can do it; I'm asking about it being done by a guy who's already highly familiar with computer science, and perhaps highly familiar with the type of program that's being reverse engineered. This is something I've never understood about open-source software, as well: you already have the code for non-open source software (otherwise your computer couldn't understand what to do), so what's the difference? I understand the copyright issues, of course; I'm talking about the technical ability to do stuff, regardless of whether it's legal. And isn't there a one-to-one mapping of machine code to a more readable form? If not, how could it be opened in Notepad (i.e. how would Notepad know that this was an ESC character and that was something else?), and how would the "Assembly languages" section of machine code be able to work? Nyttend (talk) 21:24, 30 January 2015 (UTC)[reply]
Your questions, answered in sequence:
  • "what's the point of something like Crackme?" ... "I figured the point was to enable a professional or highly-experienced amateur to be able to get the machine code from the .exe."
    • You've mixed up a "Crackme" with the tool you would use to solve it. And no, that is not the point of a crackme. A "crackme" is a software toy, a brain-teaser: it's sort of like a sudoku puzzle. It's any particular instance of a software program that is presented as a challenging problem. A programmer might find a solution to any particular "crackme" problem using other tools - like a debugger or disassembler. A person who is good at "solving" such "crackme" brain-teasers might also be good at reverse-engineering other types of software problems - there is an overlapping skill-set. A programmer might "hone their skills" by practicing on these toy programs, and later use those skills to solve real problems, potentially with illicit goals. It is my opinion that this is an inefficient use of a programmer's time: the very same person could simply go read the compiler- or debugger- manual instead of pretending that their "hackery" is equivalent to an act of magic.
Now, to your other questions:
  • "isn't there a one-to-one mapping of machine code to a more readable form...?"
    • No, this is not strictly true. The first issue is, "what constitutes a more readable form?" There are thousands of ways to answer that. A disassembler can substitute numeric codes with text-mnemonics. For most computer architectures, this mapping is one-to-one - with a few special caveats. However - this format, which is still called "machine code" - is not very suitable for a skilled programmer who wants to read, modify, and generally understand the program. Even highly-skilled programmers can only read and decipher very tiny, trivial, minuscule bits of machine-code. It doesn't matter whether your "hacker" is an autistic autodidact who learned how to program on a home-built computer, soldered out of CPUs constructed from bits of discarded aluminum cans on the street, or if she is a highly-paid, highly trained, world-caliber programmer with university pedigrees. A machine-code listing of a program longer than, say, 150 mega-words cannot be read and understood by a human. A human cannot read one hundred fifty million machine-instructions and make meaningful sense of them. This is why we use digital computers: these machines do menial jobs, like instruction fetch and decode - faster than any human can. Even if you find a speed-reader who can read ordinary English prose at a hundred pages per minute, I can find program code listings that would take hundreds of earth-years for that human to read. A decompiler - which is a totally different tool than a disassembler - attempts to take these listings and back-project them into a higher-level language, where the representation can be orders of magnitude more compact. Ergo, it attempts to summarize the long program-listing with a shorter, but exactly equivalent representation. Add extra emphasis on the word "attempt." Decompilation is notorious for producing source-code that is less readable than the machine code. This is an immature field; the state of the art is not very good.
  • "...about open-source software, as well: you already have the code for non-open source software (otherwise your computer couldn't understand what to do), so what's the difference...?" (emphasis added)
    • This is a sticky point. What is "source code"? In fact, the Free Software Foundation does not provide a definition for source code based on what it is. Instead, "The Free Software Definition" tells you what source-code is based on what you can do with it. If you can freely run, study, modify, and redistribute it, it is source code. So, if I write a perl-program, the text of my program listing is the source-code. If I use Glade to generate a user-interface for my GTK+ program, my Glade program outputs an XML file and that file is my source-code. If I use Xcode Interface Builder, I might choose to define my plist as my source code - even if that plist is committed to disk as a binary file. If I write a perl program to create the text of a different C program which I later compile - the source code might be either the text in perl, or the intermediate text in C. And if I write a perl script to produce a VHDL description of a CPU and then use that CPU to design a compiler to compile my C program and then run that C program to launch Glade and design a user interface in GTK+.... nobody really knows where my "source code" ends, and where my "program" starts. The source code, according to FSF, is whatever stuff allows another user to read, study, modify, and redistribute my program. Lawyers who specialize in licensing can quibble for years about this. Ironically, despite FSF's efforts to make a clearly-written plain-english license, the GPL is perfectly worded for attorneys' job security! Other organizations use different definitions. For example, IEEE "so construes" source code to also include binary executables. The aptly named IEEE-SCAM conference is the world's preeminent place to argue about this. If one wanted, one could make the case that they distribute "source" (in a license-compliant way) simply by distributing the binary executable of that software. I do not, however, think they would win many court-cases. Probably most amazingly: according to this very same amorphous definition, if I gave you the source listings for the script that generated the compiler that designed the CPU that ( ... ) ... the listings would be so massive that you could not freely run, study, modify, and redistribute - and they would cease to be source code according to the FSF. There is a word for this type of source-code - it's source, but no human can use it!
Nimur (talk) 22:09, 30 January 2015 (UTC)[reply]
I think you are slightly confused about the FSF. If you can freely run, study, modify, and redistribute software, it is free. The FSF does define source code in the GPL V3: "The “source code” for a work means the preferred form of the work for making modifications to it.". --Stephan Schulz (talk) 15:38, 2 February 2015 (UTC)[reply]
Yes, but what if I tell you the "preferred form" is the binary executable itself? Who gets to decide what form is preferable? This conundrum becomes even more pathological as we introduce more and different types of software development abstractions.
In the United States, this exact problem has been debated in many many high-profile and many more low-profile copyright cases. It has been the opinion of the United States that GPL-esque licenses and definitions are "overly broad" to the extent that they are unenforceable (later appealed, vacated and remanded); it has been the opinion of the United States that “source code” is the program as initially written in the programming language being used; but this is not actually codified in statute law. 17 U.S.C. § 101 Definitions defines a computer program, and defines retransmission of that program, but does not define "source code." That is a definition that is historically left to individual judges on a case-by-case basis. The Compendium of U.S. Copyright Office Practices does define source code, and has been used as a reference in court cases where the exact definition of source-code has been questioned. That definition says that source code is generally "BASIC, COBOL, or FORTRAN"... and is generally "changed... by a separate program within the computer called an assembler or a compiler to enable the program to be run on a particular brand and model computer (e.g., a compiler on a TRS-80 Model III...)", so take the definition with a grain of salt!
So, if a commercial programmer redistributes somebody else's free-licensed executable code in binary form (e.g., for the purpose of controlling a model train), that binary code may be license-compliant distribution of source code; or the license may be unenforceable; or the tables may turn the next time it comes to a court. If the exact same programmer redistributes somebody else's thread scheduler, it might be a license violation. There's a lot of gray area.
Nimur (talk) 03:01, 3 February 2015 (UTC)[reply]
See how basic my (mis)understanding is — I thought Crackme was a family of related programs ("I bought Crackme from Amazon" or "I support Crackme's programs in their corporate goals"), not realising that crackmes were a genre of programs. So "source code" isn't just the machine code: I missed out on that. Imagine that I buy a CD with a computer program on it: I know that the CD just stores 0 and 1, so I figured that it stored the machine code for the program, and that when the software manufacturer's programmers open the file, they'd have a tool that made the machine code readable by converting it into the text one or more of the programmers typed while using C++ or Visual Basic or whatever they're using. So...if you start with machine code, it's not possible to do an exact conversion into the instructions that were typed using the programming language? If that's the case, now I think I slightly understand the technical difference between open and non-open source software: is it that open, unlike non-open, comes with the instructions typed using the programming language? Nyttend (talk) 22:28, 30 January 2015 (UTC)[reply]
The CD does store the machine instructions (1s and 0s). Each instruction is a binary string that corresponds to an operation that the processor can perform; this is called an opcode. There is one opcode for every instruction the processor can execute; if there is no opcode for a task, the processor cannot do it, because it is not designed to do it, in the same way that turtles cannot fly, because they do not have wings (i.e. the required functionality just isn't there).
A sample opcode might be 00110001. This opcode might move a literal value (like "1") to the register AX (registers are the processor's temporary storage areas). A program might do this so that later it can add the value in AX (here, "1") to another value.
It's hard to remember that 00110001 refers to the opcode that moves a literal value to the register AX. So, we assign a mnemonic, like the string "MOV AX, 1". This way, we can write the opcode as "MOV AX, 1" rather than "00110001". However, the computer only understands "00110001", so we use a program called an asssembler that replaces "MOV AX, 1" with "00110001". This is called asssembling the program, and it is the literal substitution of one text string for another. There is a 1 to 1 correspondence between the mnemonics and the binary opcodes, and the assembler does the same task as a simple cipher that replaces A with 1, B with 2, and so on.
Since there is a 1 to 1 correspondence, you can also go the other direction. The mnemonics and binary opcode relationships are publicly available for any processor you will ever encounter. Going from the binary opcode to the mnemonic is called dissassembly. You can download a dissassembler (also publicly available) and run the program on your CD through it; then you will have a list of mnemonics like "MOV AX, 1" that tell you explicitly what the program does. Reverse-engineering the program requires you to read through this list of mnemonics and figure out what they do.
This is very difficult to do, because programs of any size and complexity are not written using mnemonics; instead, they are written in what is termed a "higher-level" language, such as C++, Python, or Visual Basic. This is because the opcodes do very specific, very specialized tasks, and it takes dozens (if not hundreds) of them to do something as simple as print the phrase "Hello,world" to the screen. In contrast, you can type the statement PRINT "HELLO,WORLD" in Visual Basic, and it will do this. This is what the programmers actually do, and the list of statements is called the source code. Of course, the processor can't understand PRINT "HELLO,WORLD" either, so this statement must be translated to the dozen or so opcodes that actually do the work. This is called compilation. Writing a program to do this type of translation is very, very difficult, but it is certainly possible, and in fact there are compilers for every language that programs you are likely to encounter as a consumer are written in.
Now, for the fundamental problem: there is no generalizable way to translate the list of mnemonics back to the source code; that is, to decompile. This is because there is NOT a 1 to 1 relationship between the statement PRINT "HELLO,WORLD" and the opcodes that do the work. The reason why is technical, but you can get a sense of it by thinking of the function y = x^2; you can translate any x value into a y value, but given a y value, you can't determine the unique x value that created it (for example, if y = 4, x could be 2 or -2). For the decompilation problem, it's much worse: there can be a very large number of possible source code statements generated by each discrete set of opcodes.
If you're reverse-engineering, you don't have access to the source code (otherwise, you could just find out what the program does by reading the statements like PRINT "HELLO,WORLD"). You have to work with the list of mnemonics, and there is no general way to translate this list back into the source code.
Figuring out what the program does from the list of mnemonics is possible, but for programs of any size and complexity, it is extremely difficult. The C++ source code to open a single window and print "Hello,World" in MS Windows would make a German philosopher blush in its length and complexity (you can view it here https://msdn.microsoft.com/en-us/library/bb384843.aspx). This is for the most basic program than anyone would write, one with no real functionality! This source code compiles to thousands and thousands of opcodes, and these are all you have access to. The probability of reverse engineering MS Word from such a list is vanishingly small; even Freecell is daunting.
It gets worse: the program on your CD (the executable file) doesn't just contain opcodes, it also contains data, and a file header. The disassembler cannot distinguish between the opcodes, the data, and the file header: it just translates 0010001 to MOV AX, 1, even though 0010001 might be an ASCII character to be displayed, or the part of the header that tells MS Windows how big the file is.
It get worse II: the source code will have variables, functions, and control structures with descriptive names. You might have StringToPrint, printf(), and a control structure such as if...then. The dissassembled opcodes have none of this: variables are either raw addresses, or elements on the program stack. Functions (generally called subroutines in assembly) are called by their address; that is, you don't call printf(), you jump to address 00111001101110011111100100111101. You don't see "if x = 1 then printf()", you see a comparison instruction followed by an address.
It gets worse III: modern compilers optimize their output code in ways that, while efficient, are often extremely non-intuitive. Reverse engineering a program in the 1960s might be compared to figuring out how your auto's manual transmission works; today, it is figuring out the computer-controlled automatic transmission, with no manual or specifications.
It gets worse IV: companies don't want people reverse engineering their programs, so they have their programmers obfuscate the code, or encrypt it, or insert sections that look to see if the program is being debugged (as opposed to run by the end user) and if so, change the program's behavior (this is possible because debuggers have to call system interrupts to execute the program line by line; the programmers can monitor these interrupts and react accordingly).
It gets better (?). There are workarounds for these problems: compilers have standardized ways of doing things, such as how they implement an if...then statement. The location and contents of file headers is generally known, and most programs keep their data lumped together in different dedicated sections (some interleave it with the opcodes though, and that is a BITCH to work through). The more common optimizations are known, and can be accounted for. You can even defeat encryption or debugger-trapping: the program has to be decrypted at some point (otherwise it can't run), and you can patch the code to remove the trapping, or run it in a virtual machine where you can step through line by line without explicit interrupt calls. None of this is easy or quick; in fact, the reverse engineers mantra might be "fast, good, or cheap: pick any ONE." Then pray.
I hope this helps answer your question, and that I haven't been too wordy. As you can see, I'm very passionate about this, and I'm glad your question gave me the chance to go on about it at length. Good luck! OldTimeNESter (talk) 10:33, 31 January 2015 (UTC)[reply]

hidden code in HTML files[edit]

One of my pages has several inches of white space that I didn't put in. There is noting in the code that would cause that result. When I look at the code using the Firefox tool Inspect Element - Inspector I see a slew of breaks:

<div>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <br></br>
    <table style="width:100%; margin-left:50px">

This is what the visible code shows:

<div><table style="width:100%;  margin-left:50px">

What is happening, and how can I deal with it? --Halcatalyst (talk) 17:50, 30 January 2015 (UTC)[reply]

If you had told us 'which' page we can more easily diagnos it. It could be another editor adding blank lines.--Aspro (talk) 20:16, 30 January 2015 (UTC)[reply]
I'm sorry. It's near the top of http://dendurent.com/dend/PoemsLMD.html. --Halcatalyst (talk) 21:33, 30 January 2015 (UTC)[reply]
it's all the <br> found at the end of each table data row. Get rid of them ... they stack up at the top for who knows what reason, and they're not required, quite apart from being placed so as to be useless.
<tr>
           <td><a href="#fairy">"Fairy Drift"</a></td><br>
           <td><a href="#memoriam">"In Memoriam"</a></td><br>
</tr>

--Tagishsimon (talk) 22:00, 30 January 2015 (UTC)[reply]

In fact, they're not even valid HTML: nothing may occur in a <tr> except <th> and <td> elements. But most browsers ignore this sort of syntax error. --ColinFine (talk) 01:25, 31 January 2015 (UTC)[reply]
More importantly, the HTML specification says what browsers are supposed to do for valid HTML code. They say nothing whatever about what the browser is supposed to do when faced with incorrect HTML like this. So you may find that one browser produces something reasonable when you make mistakes like this - where another browser generates a totally screwed up page. Neither browser is behaving incorrectly - so it's important that you don't generate garbage. Even when it look OK on your browser - you can't possibly check it on every version of every browser on every platform and every window size and screen resolution. SteveBaker (talk) 03:14, 31 January 2015 (UTC)[reply]
And mobile browsers seem more sensitive to invalid HTML. --  Gadget850 talk 11:17, 31 January 2015 (UTC)[reply]

Serials and digital signature[edit]

Hi,
While using a keygen, I wondered, how come those serials are reproducible?
I mean, if any serial represents only a simple digital signature, one must break the key, to become a serial generator.
The only solution that I can think about is, that the processing power rises exponentially, but then, every modern software will be keygen-immune.
So, how come that software companies, just don't use electronic signatures to produce serials? Exx8 (talk) 23:47, 30 January 2015 (UTC)[reply]

I can answer the first bit. Usually it's a series of equations that bring about a certain result at the end called a check digit. The sequence of numbers and the check digit need to make sense according to those equations. Credit card numbers work much the same way. That's one of the reasons why a programme can tell right off the bat if you've entered an invalid CC number. Sir William Matthew Flinders Petrie | Say Shalom! 11 Shevat 5775 17:02, 31 January 2015 (UTC)[reply]
Digital signatures are fairly large. The rule of thumb seems to be that you need a 4n-bit signature to get n bits of security. Serial numbers tend to be made of case-insensitive letters and digits with ambiguous letters like O omitted, which works out to about 5 bits per character. A 25-character serial number would then only get you about 31 bits of security, which is useless, and 25 characters is already a hassle for legitimate customers. -- BenRG (talk) 18:37, 31 January 2015 (UTC)[reply]