JAVASCRIPT 2 – Strings and text

This tutorials outline how we can use node.js to analyze and explore text. It is an excerpt from Dan Shiffman’s excellent tutorial here.

These commands can be typed directly into the node environment in your command line (remember how you get there?  Open your terminal or git bash, and type node into the prompt). Remember, to exit node press CONTROL C twice.
Or they can be run as text files from the command line, link to repository at the bottom.

The code examples in this tutorial can be found in this download. It refers to the examples in the folders – 02_Strings and 03_fileinput_node.


In all of our JavaScript programs, we’ll be using String objects to store textual information. You may be familiar with Strings in Processing or poked around the Javadoc reference for Strings. Strings in JavaScript have a lot of the same functionality that they do in Java and we’ll start by looking at some of the basic methods for manipulating Strings in JS.

A String, at its core, is really just a fancy way of storing an array of characters. With the String object, we might find ourselves writing code like.

var sometext = ['h','e','l','l','o'];

Interestingly enough, there is no distinction between an individual character or a String in JS. Both of the variables below are storing the same datatype.

var a = 'a';
var h = 'hello';

In JavaScript, Strings can be literal primitives or objects.

var s1 = 'hello';               // a primitive
var s2 = new String('hello');   // an object

For the most part, this is a distinction we don’t have to worry about. JS will automatically covert our primitive String into an object when necessary. In general, it’s good practice to initialize your Strings as primtives to increase performance.


JavaScript provides us with a basic set of String functions that allow for simple manipulation and analysis. Next week, we’ll also look at how regular expressions can allow to perform advanced String processing, but we’ll start this week with non-regex String methods and gather some skills doing all of our text processing manually, character by character. All of the availabe String properties and functions are laid out in the JavaScript reference, and we’ll explore a few useful ones here. Let’s take a closer look at three: indexOf(), substring(), and the length property.

indexOf() locates a sequence of characters within a string. For example, run this code and examine the result:

var sentence = 'The quick brown fox jumps over the lazy dog.';
console.log(sentence.indexOf('blah blah'));

Note that indexOf() returns a 0 for the first character, and a -1 if the search phrase is not part of the String.

After we find a certain search phrase within a String, we might want to pull out part of the String and save it in a different variable. This is what we call a “substring” and we can use java’s substring() function to take care of this task. Examine and run the following code:

var sentence = 'The quick brown fox jumps over the lazy dog.';
var phrase = sentence.substring(4,9);

Note that the substring begins at the specified beginIndex (the first argument) and extends to the character at endIndex (the second argument) minus one. Thus the length of the substring is endIndex minus beginIndex.

At any given point, we might also want to access the length of the String. We can accomplish this with the length property.

var sentence = 'The quick brown fox jumps over the lazy dog.';
It's also important to note that we can concatenate (i.e. join) a String together using the + operator. With numbers plus means add, with Strings (or characters), it means concatenate, i.e.
var num = 5 + 6;                        // ADDING TWO NUMBERS!
var phrase = 'To be' + ' or not to be'; // JOINING TWO STRINGS!


One String-related function that will prove very useful in our text analysis programs is split(). split() separates a group of strings embedded into a longer string into an array of strings.

Examine the following code:

var spaceswords = 'The quick brown fox jumps over the lazy dog.';
var list1 = spaceswords.split(' ');
var commaswords = 'The,quick,brown,fox,jumps,over,the,lazy,dog.';
var list2 = commaswords.split(',');
for (var i = 0; i < list2.length; i++) {
  console.log(i + ': ' + list2[i]);
// Calculate sum of a list of numbers in a string
var numbers = '8,67,5,309';
var numlist = numbers.split(',');
var sum = 0;
for (var i = 0; i < numlist.length; i++) {
  sum = sum + Number(numlist[i]);  // Converting each String into an number!

To perform the reverse of split, we can write a quick function that joins together an array of Strings.

var words = ['it','was','a','dark','and','stormy','night'];

Knowing about loops and arrays we could join the above array of strings together as follows:

// Concatenating an array of Strings manually
function join(str, separator) {
  var stuff = '';
  for (var i = 0; i < str.length; i++) {
    if (i != 0) stuff += separator;
    stuff += str[i];
  return stuff;


This explanation pertains to the 03_fileinput_node/fileio.js example in the zip folder.

To start, we are going to be working in the simple world of text in and text out. We are going to do this a few ways. To start we are going to process a text file using a simple node.js server-side program.

Let’s start with a simple node.js program. To load from a file, we’ll use the file system module.

var fs = require('fs');

One thing that is nice about working with the command line is that we can pass in filenames as arguments to a program. For example, let’s say we have a JS file process.js. From the command line, we’ll say:

% node process.js myfile.txt

The program will then read this text file as input. To accomplish this, we can grab the filename as the third element (index 2) of the arugments array.

var filename = process.argv[2];

We can even check to make sure the user entered a file name.

if (process.argv.length < 3) {
  console.log('Oops, you forgot to pass in a text file.');

We’ll use the readFile() method to read the file. readFile() takes three arguments — the name of the file, the format of the file, and a function that will executed when the data from the file is ready (known as a callback).

fs.readFile(filename, 'utf8', analyze);

The use of a callback is very typical of JavaScript, and we’ll be seeing many examples of this over the course of the semester. It’s also possible to write an “anonymous” function directly as an argument to readFile() but this will make the code a bit harder to follow. Let’s take a look at the analyze() function.

function analyze(err, data) {
  if (err) {
    throw err;
  console.log('OK: ' + filename);

The function takes two arguments: err and data. err will be undefined (unless there’s an error) and data will contain all of the text from the file in a String (unless there was an error). If you’re not familiar with throw err, take a look at the documentation for throw and try/catch. If this looks like gobblty gook to you, don’t worry, this is just a way to get our program to show us errors when they are encountered.

Once we’ve gotten the hang of reading and writing files, we can start to think about ways of creating output text based on an input text. For example, we could do something as simple as make a new text with every other word from a source text. To do this, we can split the text up into an array of Strings (with space as a delimiter) and create a new String by appended every other word to it. StringBuffer is good to use in case we are dealing with really long texts.

We do this by adding this to the analyze function and then passing the string everyotherword to the fs.writeFile(“output.txt”, everyotherword, output) function instead of just the data.

// Split text by wherever there is a space
var words = data.split(" ");
var everyotherword = '';
for (var i = 0; i < words.length; i+=2) {
  var word = words[i];
  everyotherword += word + ' ';

Using the Nigerian Spam as a source text, the result is something like:

On 12th, a contractor the co-orporation, Kingdom Olaf made time
Deposit twelve months, at US$ (Seventeen Three Hundred fifty
Thousand only) my maturity,I a notification his address but no
After month, sent reminder finally from contract the Pertroleum
co-orporation Mr.Olaf died an accident further found that died
making WILL,and attempts his of was therefore further and
that Olaf
Another thing we might try is to search for every time a certain word appears. The following code examines a text for every time the word “God” appears and keeps the word “God” along with what follows it:

var words = data.split(' ');
for (var i = 0; i < words.length-1; i++) {
  if (words[i] == 'God') {
    console.log(words[i] + ' '  + words[i+1]);

The result applied to Genesis from the Bible looks something like:

God Almighty
God forbid
God hath
God did
God hath
God of
God Almighty
God make
God of
God of
God meant
God will

We could also reverse all the characters in a text, by walking through the String backwards. Note how the for loop starts at the end of the String (data.length – 1). Also notice the very useful function charAt. This function returns the character at the index number i of the string.

// Reverse all the characters in the text
var output = '';
for (var i = data.length-1; i >= 0; i--) {
  output += data.charAt(i);

The result applied to the Nigerian Spam looks something like:

rof %5 dna uoy rof %53 dna em rof %06 fo oitar eht ni erahs ot su
rof tnuocca ruoy otni diap eb lliw yenom ehT .refsnart eht rof rovaf
ruoy ni noitartsinimda/etaborp fo rettel dna stnemucod yrassecen eht
niatbo ot dna LLIW eht fo noitaziraton dna gnitfard rof yenrotta na
fo secivres eht yolpme llahs eW .nik fo txen eht sa ecalp ni uoy tup
lliw taht stivadiffa dna stnemucod yrassecen eht eraperp lliw yenrotta


We’ll end this week by looking at a basic example of text analysis. We will read in a file, examine some of its statistical properties, and write a report. Our example will compute the Flesch Index (aka Flesch-Kincaid Reading Ease test), a numeric score that indicates the readability of a text. The lower the score, the more difficult the text. The higher, the easier. For example, texts with a score of 90-100 are, say, around the 5th grade level, wheras 0-30 would be for “college graduates”. The result of the test on a few sample texts (the Bible, spam, a New York Times article, and Processing tutorials I’m writing) are displayed to the right.

The Flesch Index is computed as a function of total words, total sentences, and total syllables. It was developed by Dr. Rudolf Flesch and modified by J. P. Kincaid (thus the joint name). Most word processing programs will compute the Flesch Index for you, which provides us with a nice method to check our results.

Flesch Index = 206.835 – 1.015 * (words / sentences) + 84.6 * (syllables / words)

Our pseudo-code will look something like this:

Read input file into String object
Count words
Count syllables
Count sentences
Apply formula
Write out report file
We know we can read in text from a file and store it in a String object as demonstrated in the example above. Now, all we have to do is examine that String object, counting the total words, sentences, and syllables, applying the formula as a final step. To count words, we’ll use split().

The first thing we’ll do is count the number of words in the text. We’ve seen in some of the examples above that we can accomplish this by using split() to split a String up into an array wherever there is a space. For this example, however, we are going to want to split by more than a space. A new word occurs whenever there is a space or some sort of punctuation.

var delimiters = /[.:;?! !@#$%^&*()]+/;
var words = data.split(delimiters);
You’ll notice some new syntax here. /[.:;?! !@#$%^&*()]+/ is a regular expression. We are going to cover regex in detail next week. For now, we should just understand this as something that indicates a list of possible delimiters (any character than appears between /[ and ]+/).

Now we have split up the text, we can march through all the words (tokens) and count their syllables.

for (var i = 0; i < words.length; i++) {
  var word = words[i];
  totalSyllables += countSyllables(word);

Ok, so countSyllables() isn’t a function that exists in JavaScript. We’re going to have to write it ourselves. The following method is not the most accurate way to count syllables, but it will do for now.

Syllables = total # of vowels in a word (not counting vowels that appear after another vowel and when ‘e’ is found at the end of the word)
“beach” – > one syllable
“banana” – > three syllables
“home” – > one syllable
Our code looks like this:

// A method to count the number of syllables in a word
// Pretty basic, just based off of the number of vowels
// This could be improved
function countSyllables(word) {
  var syl    = 0;
  var vowel  = false;
  var length = word.length;
  // Check each word for vowels (don't count more than one vowel in a row)
  for (var i = 0; i < length; i++) {
    if (isVowel(word.charAt(i)) && (vowel == false)) {
      vowel = true;
    } else if (isVowel(word.charAt(i)) && (vowel == true)) {
      vowel = true;
    } else {
      vowel = false;
  var tempChar = word.charAt(word.length-1);
  // Check for 'e' at the end, as long as not a word w/ one syllable
  if (((tempChar == 'e') || (tempChar == 'E')) && (syl != 1)) {
  return syl;
// Check if a char is a vowel (count y)
function isVowel(c) {
  if      ((c == 'a') || (c == 'A')) { return true;  }
  else if ((c == 'e') || (c == 'E')) { return true;  }
  else if ((c == 'i') || (c == 'I')) { return true;  }
  else if ((c == 'o') || (c == 'O')) { return true;  }
  else if ((c == 'u') || (c == 'U')) { return true;  }
  else if ((c == 'y') || (c == 'Y')) { return true;  }
  else                               { return false; }

As we will see next week, the above could be vastly improved using Regular Expressions, but it’s nice as an exercise to learn how to do all the String manipulation manually before we move on to more advanced techniques.

Counting sentences is a bit simpler. We’ll just split the content using periods, question marks, exclamation points, etc. (“.:;?!”) as delimiters and count the total number of elements in the resulting array. This isn’t terribly accurate; for example, “My e-mail address is” will be counted as three sentences. Nevertheless, as a first pass, this will do.

// Look for sentence delimiters
var sentenceDelim = /[.:;?!]/;
var sentences = data.split(sentenceDelim);
totalSentences = sentences.length;

Now, all we need to do is apply the formula, generate a report.

// Calculate flesch index
var f1 = 206.835;
var f2 = 84.6;
var f3 = 1.015;
var r1 = totalSyllables / totalWords;
var r2 = totalWords / totalSentences;
var flesch = f1 - (f2  r1) - (f3  r2);
// Write Report
var report = "";
report += "Total Syllables: " + totalSyllables + "\n";
report += "Total Words    : " + totalWords + "\n";
report += "Total Sentences: " + totalSentences + "\n";
report += "Flesch Index   : " + flesch + "\n";

The full example code is here.