Developing a language-learning program (high-frequency sentence identifier)

UPDATE: I devoted most of my nights and weekends this year to building the program described below with the help of Kostyantyn Grinchenko, an excellent Ukrainian freelance developer. Then, after realizing that we had stumbled upon a breakthrough idea that could revolutionize language learning and help many people become fluent readers of their target language, I assembled a remote team of freelance and volunteer developers, designers, native-speaker audio recorders, and translators to help me develop it into a webapp (and future mobile app): Please help WordBrewery grow by trying it out, joining our email list, subscribing to our blog, following us on Facebook and Twitter, or taking a beta tester survey. Please get in touch if you are interested in supporting the WordBrewery project as a donor, investor, advocate, or volunteer. Thank you for your support.

Over the past two months, I’ve been experimenting with outsourcing small personal digital projects to overseas freelancers on Upwork. The experience has been very positive so far. I’ve decided to devote a post to each of the tasks I outsource to give readers a sense of how freelancers and virtual assistants can both make one’s life easier in various ways and enable one to pursue projects and ideas despite lacking the time, expertise, or patience to implement them without outside help and delegation. I find Upwork and similar programs to be fascinating–a fantastic, unambiguously positive example of open markets and the division of labor.

I am very interested in finding efficient ways to learn things, especially languages. Language vocabulary is an example of the “Pareto Principle” (also known as the “80-20 rule”), according to which, “for many events, roughly 80% of the effects come from 20% of the causes.” In the context of language learning, the rule fits the fact that the most common 1,000 words or so in a language account for about 90% of spoken speech in that language. Thus, the best way to quickly develop one’s practical vocabulary in a second language is to focus on a core vocabulary list of high-frequency words.

This insight, however, most be paired with another well-established principle of literacy and language learning: it is far better to learn vocabulary by studying it in context rather than in the form of isolated words in a list or on flash cards. Accordingly, language learners should study sentences, not words.

I thus came up with the idea for a language learning tool that identifies sentences with numerous high-frequency words: such sentences are especially valuable for language learners. But I could not find any program or tool that does this seemingly straightforward task. I know some basic computer programming, but for the near future I will lack the time and expertise to create such a tool from scratch myself. So I’ve decided to partially outsource it, and I’ve found several overseas programmers who appear to be very capable of creating the program as I envision it. Here is the job as posted on Upwork:

Language learning tool: High-frequency sentence identifier

Purpose: allow language learners to identify and collect sentences that are particularly valuable to study and memorize because they use several high-frequency words and do not include any obscure words.

Suggested mode of operation:

-For each language, the program would draw upon data from at least two sources:

(1) a spreadsheet (CSV) or other easily manipulable data storage mechanism in which the 10,000 or 20,000 most column words of a language are pasted. In column one, assign the word a frequency score: the higher the frequency, the higher the score. So, for example, if the list has 500 words, the least common (most obscure) word in that list could be assigned a score of “1,” and the most common word in the list could be assigned a score of 500.

(2) a list or corpus of example sentences, whether scraped in real-time from Wikipedia or other websites (e.g. Project Gutenberg, Tatoeba), or manually created and made available in a text file. For example, an Anki deck of example sentences ( can be exported into a text file:

-El ministro ha informado a la nación sobre la guerra. (The prime minister has informed the nation about the war.)
-Él tragaba sus bebidas rápidamente. (He used to gulp his drinks too quickly.)

-The program would iterate over the target-language example sentences, scoring each sentence for its potential value to the language-learner. Sentences that use several high-frequency words would receive a higher score–and thus be more likely to be presented to the user. Sentences that include an uncommon or obscure word that does not appear on the frequency list at all (e.g., a word that is not among the 20,000 most common words of a language) would be penalized and less likely to be presented to the user.

-The program would have some mechanism for avoiding or reducing repetitiveness so that it brings in diverse sentences that do not repeat the same high-frequency words over and over. Ideally, this would be flexible so it does not completely disqualify a sentence just because one of its words was already used: this would disqualify too many sentences. But the program could somehow penalize sentences that use words that have already appeared in sentences the program has selected.

-The program would produce a CSV or text file of the sentences it has selected. The user should be able to instruct the program to return a certain number of sentences.

-To avoid repetitive results in repeat uses, the program should include an element of randomness in selecting sentences for presentation to the user. Ideally, the user should be able to set the randomness level. A randomness level of 0, for example, could remove all randomness and simply return the highest-scoring sentences, while a randomness level of 100 could return a completely random set of sentences.

Technical necessities:

-The program needs to be able to handle non-English characters and words such as French/Spanish/Portuguese accented characters as well as Japanese, Chinese, Korean, and Thai. For the Asian languages, the program will also need to be able to cope with sentences that do not separate words with spaces.

-The program must be able to handle sentence data formatted in varying ways from diverse sources. For example, it should use regular expressions or other tools to automatically identify the delimiters that signal where each sentence begins and ends. Whether the program is iterating over a text file exported from Anki or a messy webpage, the program should return only sentences–not single words, multiple sentences, translations, HTML tags, etc.

Further ideas:

-I would like the scoring mechanism to be flexible so I can experiment with different ways of scoring the sentences.

-It would be a bonus if the program could somehow be used to identify sentences that use a particular grammatical pattern or construction.

-It would further be helpful if the search mechanism accepted close matches to words on the frequency list to account for different verb conjugations, spelling variations, etc.

-I would like it if the program could, at the user’s option, show its work by: (1) showing or hiding each sentence’s frequency score in its results, and (2) returning not only sentences, but also–for each sentence–the high-frequency words the program found in that sentence, and those words’ frequency score.

Please email me if you have suggestions or would like to help out with creating the program.

Learning blackjack, part 2: strategy

I wrote a complete version of this post a couple of days ago on a plane, but somehow lost the file. But then, I am writing this in part to fix it in my memory and build my understanding, so there’s no harm in writing it again.

Below is what I have gleaned about basic blackjack strategy, primarily from two sources: the iPhone app Blackjack 101 Free and the book The Most Powerful Blackjack Manual by Jay Moore. Credit for any insights goes to those sources, and blame for any mistakes rests with me. I am a novice still trying to learn the game, so I would caution against taking the advice below without a healthy dose of skepticism.

Key and glossary (basic rules were outlined in the previous post): A = Ace (value of 1 or 11); K = King (value of 10); Q = Queen (value of 10); J = Jack (value of 10);

“soft” hand = hand that includes an ace, in which the ace can count as either 1 or 11 without the hand going over 21; “hard” hand = hand that does not include an ace, or hand in which the value of the ace must equal 1 so as to avoid exceeding 21;

“blackjack” = a winning hand that totals exactly 21 (e.g. 10 + A);

“to bust” = to have cards whose total value exceeds 21, thereby losing the round; “busting” or “breaking” hand = a hand that has not yet busted but that presents a hightened risk of busting.

Basic principles:
-I always thought that the object of blackjack is to get as close to 21 as possible without busting. In fact, however, the object is to beat the dealer, which may involve standing on far less than 21 if the dealer’s upcard suggests that the dealer may bust. The player acts before the dealer, and if the player busts, she loses regardless of what happens to the dealer.
-There are thus two variables in every blackjack hand that determine the appropriate move: (1) the strength of one’s own hand; and (2) the apparent strength of the dealer’s hand, as suggested by the dealer’s upcard.
-The average winning total in a blackjack hand is 19. But if one has 17, any additional cards are likely to cause the hand to bust. Therefore, regardless of what the dealer’s upcard is, the player should always stand on “hard 17.”
-The strongest card for both the dealer and the player is a “soft” hand: a hand that includes an ace.
-The dealer makes no decisions; she simply follows a script. She must always hit on a total of hard 16 or below and stand on a total of hard 17 or higher; and depending on the table’s rules, she must either hit or stand on a soft 17 (Ace + 6).
-The core strategy: bet aggressively but play conservatively when the dealer’s hand (upcard) is weak to avoid busting and allow the dealer to bust; bet conservatively but play aggressively when the dealer’s hand (upcard) is strong to try to reach a higher total than the dealer without busting.
-The strongest card is an ace, as mentioned above; other strong upcards for the dealer are cards with a value of 10 (10/J/Q/K) or 9. An upcard of 8 is neutral, leaning toward strong; an upcard of 7 is neutral, leaning toward weak; an upcard of 2 or 3 is weak; and an upcard of 4, 5, or 6 is the weakest. Thus, for instance, regardless of what the player is holding, the player should bet aggressively and play conservatively–hoping to avoid busting and wait for the dealer to bust–when the dealer shows a “busting” upcard, i.e. a 4, 5, or 6.

Learning blackjack

In two weeks, I am going to Las Vegas for my brother’s bachelor party. It is inevitable that I will gamble a bit and lose that money which I gamble; I consider this an entertainment expense, not an opportunity to make money. Nevertheless, it seems wise to attempt to limit the damage—or at least acquire some knowledge as a consolation prize—by becoming well-informed about precisely how the casino will be taking my money.

Additionally, for years I have wanted to learn statistics. In second grade, I kept “stat books” in which I attempted to study the relative strengths and weaknesses of football and hockey teams and make predictions about who would defeat whom. In eleventh grade, however, my math education derailed, and I never got around to learning statistics. When I have tried to do so, I have been put off by the giant equations with their symbols and subscripts, and dismissed the topic as something I am fated to never understand. 

So, motivated by these two complementary aims, I’ve decided to learn a casino game, and (I think) the one with the best odds is Blackjack. Accordingly, I’ve done some reading on Blackjack strategy over the past two days, and I’ll post what I have learned. Credit for any correct assertions goes to Jay Moore’s “The Most Powerful Blackjack Manual” and the iPhone app “BlackJack 101 Free.” It seems there is some disagreement about strategy at the margins—which is exactly what makes a game fun, anyway—so do not take what follows as The Truth. I am a confessed novice: I know nothing about this subject except, perhaps, for what follows.

As introduction, the object of blackjack is for the player to defeat the dealer; the player and the dealer each get two cards; all face cards (J, Q, K) are worth 10; the object is to draw cards amounting to a number as close as possible to 21 without going over 21; and if either the dealer or the player goes over 21, that is called a “bust.” The dealer deals the player’s cards face-up, but the dealer himself keeps one card face up (the “upcard”) and one card face down (the “hole card”). To “hit” is to request another card (at the risk of “busting,” i.e. going over 21), to “stand” is to refuse additional cards (at the risk of having a lower total than the dealer), to “split” if one is dealt a pair (e.g. two fives or two aces) is to double one’s bet and play two separate hands simultaneously, and to “double down” is to double one’s initial bet and accept only one additional card. The dealer must follow a particular script, usually “hitting” if his cards total 16 or fewer points, but “standing” if his cards total more than 16.

Strategy to come in the next post.


My homepage is due for an update. It has remained essentially unchanged since I first built it in 2009. Similarly, the two blogs I kept on and off from about 2003 until about 2011 can safely be declared defunct. It is time for a fresh start and a new website that better reflects my current interests and is compatible with my new career as an attorney. With respect to the latter: none of my posts on this blog will deal with law or politics.

Rethinking my website entailed several decisions. My first decision was to switch from a static, portfolio-style homepage to a blog. You can still find a list of my publications by following the link at the top of this page. But rather than continuing to feature older publications reflecting bygone interests on the front page of the site, I decided that the flexibility of a blog would allow me to update more easily and share more content.

I then had to decide what the theme of the blog would be. This is the difficult part, because I am curious about everything; my interests are diverse and frequently change or reorder themselves. I read and write about many different topics, and I tend toward breadth rather than depth of exploration and knowledge. Any proficiency I develop on a topic is often the result of sporadic, short bursts of intense study. (The exceptions to this rule are my core professional interests of writing and law–for these, I aim for depth and mastery.)

So the trouble with selecting a narrow theme is that I would lose interest in that theme, almost inevitably. Instead, therefore, curiosity and self-education will be my theme: I’ll write about what I’m reading, thinking about, studying, and discovering in my free time, whether that is music theory, computer programming, foreign languages, or any number of other topics. In the end, I am writing to learn and help myself remember what I learn. As a byproduct of that, I hope to write some posts that are helpful and interesting to others.