Just a query: But how did it know STRUCTURE of the English language, to be able to form form sentences (80% of the time...seemingly above the percentage of chance), if it didn't know the English language to beGIN with?
I don't know how to program was written, so I can only guess based on how I'd try to do something like this.
First we must be precise about what it did. The program, without any specific understanding of English vocabulary or syntax, was fed the text of Emma (minus spaces and punctuation) by Jane Austin. It was then able to identify 80% of the word breaks and sentence breaks. Presumably it would have been able to accomplish a similar feat on, say, a German novel if that had been handed to it.
How might it work?
Well if I was writing something like this, I'd start by looking for short strings that get repeated a lot. Those are probably words. There is a lot of research into how to recognize that kind of pattern. It is very important for, say, compression algorithms. After the program has a list of things that it thinks are words, it would then look for patterns in the lists of words that are recognizably sentences. It isn't obvious how to do this, and in fact doing this non-obvious part is part of why this was research.
So after a couple of passes through the text, the program has "learned" enough English to be able to identify likely word and sentence breaks. With, apparently, about 80% accuracy. Its performance has to do with the fact that there is a structure to English. It has nothing to do with why that structure is there or what it represents. After processing the book, the program likely has figured out that "word" is a word. But the program has no idea what a "word" is. It just knows that that is a string that appears a lot, and should probably be marked with spaces.
Now the authors apparently want to try analyzing the human genome and seeing whether they find useful boundaries. Not being a biologist, I can think of a couple of reasons why they might find useful boundaries. First of all there are natural boundaries in DNA that somehow cause only certain stretches to get copied to RNA and eventually into creating proteins. Also something like 45% of the human genome is made up of transposons and their remnants, so there are a lot of repeated patterns to find. (Transposons, also known as jumping genes, are sections of DNA that can copy themselves to other parts of your DNA. Yes, genetics really is more complex than Mendel would have you believe...)
Will they get anywhere with this type of approach? Will the results be useful for biologists? I don't know and nor do they. That is why what they are doing is called research.
Cheers,
Ben