Source Model
In this assignment you’ll practice
You’re interested in natural language processing …
Write a class named SourceModel
reads a file containing a training corpus and builds a first-order Markov chain of the transition probabilities between letters in the corpus. Only alphabetic characters in the corpus should be considered and they should be normalized to upper or lower case. For simplicity (see background) only consider the 26 letters of the English alphabet.
Here are some example corpus files and test files:
You can assume corpus files are of the form <source-name>.corpus
.
Note: this section is here for those interested in the background, but you don’t need to know it to complete this assignment. If you just want to get it done, feel free to skip to the Specific Requirements section.
In machine learning we train a model on some data and then use that model to make predictions about unseen instance of the same kind of data. For example, we can train a machine learning model on a data set consisting of several labeled images, some of which depict a dog and some of which don’t. We can then use the trained model to predict whether some unseen image (an image not in the training set) has a dog. The better the model, the better the accuracy (percentage of correct predictions) on unseen data.
We can create a model of language and use that model to predict the likelihood that some unseen text was generated by that model, in other words, how likely the unseen text is an example of the language modeled by the model. The model could be of a particular author, or a language such as French or English. One simple kind of model is well-suited to this problem: Markov chains.
Markov chains are useful for time-series or other sequential data. A Markov chain is a finite-state model in which the current state is dependent only on a bounded history of previous states. In a first-order Markov chain the current state is dependent on only one previous state.
One can construct a simple first-order Markov chain of a language as the transition probabilities between letters in the language’s alphabet. For example, given this training corpus of a language:
BIG A, little a, what begins with A?
Aunt Annie’s alligator.
A...a...A
BIG B, little b, what begins with B?
Barber, baby, bubbles and a bumblebee.
We would have the model:
We’ve only shown the letter-to-letter transitions that occur in the training corpus. Notice:
This model indicates that whenever the letter a occurs in our training corpus, the next letter is a, b, e, g, h, i, l, n, r, s, t, u or w. The arrow from a to b is labeled .19 because b appears after a 3 out of 16 times, or approximately 19 percent of the time. A first-order Markov chain is a kind of bigram model. Here are all the bigrams in the training text that begin with a, that is, all the state transitions from a:
(a, l), (a, w), (a, t), (a, a), (a, u), (a, n), (a, l), (a, t),
(a, a), (a, a), (a, b), (a, t), (a, r), (a, b), (a, n), (a, b)
A Markov chain represents all the bigrams and their probabilities of occurrence in the training corpus.
A Markov chain can be represented as a transition matrix in which the probability of state j after state i is found in element (i, j) of the matrix. In the example below we have labeled the rows and columns with letters for readability. The probability of seeing the letter n after the letter a in the training corpus is found by entering row a and scanning across to column n, where we find the probability .12.
a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z | |
a | 0.19 | 0.19 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.12 | 0.01 | 0.12 | 0.01 | 0.01 | 0.01 | 0.06 | 0.01 | 0.19 | 0.06 | 0.01 | 0.06 | 0.01 | 0.01 | 0.01 |
b | 0.12 | 0.12 | 0.01 | 0.01 | 0.24 | 0.01 | 0.01 | 0.01 | 0.12 | 0.01 | 0.01 | 0.18 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.12 | 0.01 | 0.06 | 0.01 | 0.06 | 0.01 |
c | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
d | 1.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
e | 0.11 | 0.22 | 0.01 | 0.01 | 0.11 | 0.01 | 0.22 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.11 | 0.22 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
f | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
g | 0.40 | 0.20 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.40 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
h | 0.75 | 0.25 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
i | 0.01 | 0.01 | 0.01 | 0.01 | 0.10 | 0.01 | 0.30 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.20 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.40 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
j | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
k | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
l | 0.01 | 0.01 | 0.01 | 0.01 | 0.50 | 0.01 | 0.01 | 0.01 | 0.38 | 0.01 | 0.01 | 0.12 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
m | 0.01 | 1.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
n | 0.01 | 0.01 | 0.01 | 0.17 | 0.01 | 0.01 | 0.01 | 0.01 | 0.17 | 0.01 | 0.01 | 0.01 | 0.01 | 0.17 | 0.01 | 0.01 | 0.01 | 0.01 | 0.33 | 0.17 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
o | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 1.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
p | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
q | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
r | 0.33 | 0.67 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
s | 0.50 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.50 | 0.01 | 0.01 | 0.01 |
t | 0.10 | 0.20 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.20 | 0.01 | 0.01 | 0.01 | 0.20 | 0.01 | 0.01 | 0.10 | 0.01 | 0.01 | 0.01 | 0.01 | 0.20 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
u | 0.01 | 0.33 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.33 | 0.33 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
v | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
w | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.50 | 0.50 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
x | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
y | 0.01 | 1.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
z | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
Given a Markov chain model of a source, we can compute the probability that the model would produce a given string of letters by applying the chain rule. Simply stated, we walk the transitions in the Markov chain and multiply the transition probabilities. For example, the probability that “Big C, Little C” would be produced by our model, we would get the following probabilities from the transition matrix:
p(b, i) = .12
p(i, g) = .30
p(g, c) = .01
p(c, l) = .01
p(l, i) = .38
p(i, t) = .40
p(t, t) = .20
p(t, l) = .20
p(l, e) = .50
p(e, c) = .01
Multiplying them gives us 1.0588235294117648e-10. Notice that, in order to avoid getting zero-probability predictions using our simplified technique, we store .01 in our transition matrix for any bigram we don’t see in the training corpus. Also note that for larger test strings we would underflow the computer’s floating point representation and end up with zero probability. There are techniques for avoiding this problem discussed in the Additional Information listed below.
We’ve greatly simplified the presentation here to focus on the programming. For more information consult the following references.
Write a class called SourceModel
with the following constructors and methods:
A single constructor with two String
parameters, where the first parameter is the name of the source model and the second is the file name of the corpus file for the model. The constructor should create a letter-letter transition matrix using this recommended algorithm sketch:
Initialize a 26x26 matrix for character counts
Print “Training {name} model … “
Read the corpus file one character at a time, converting all characters to lower case and ignoring any non-alphabetic character.
For each character, increment the corresponding (row, col) in your counts matrix. The row is the for the previous character, the col is for the current character. (You could also think of this in terms of bigrams.)
After you read the entire corpus file, you’ll have a matrix of counts.
From the matrix of counts, create a matrix of probabilities – each row of the transition matrix is a probability distribution.
Print “done.” followed by a newline character.
A getName
method with no parameters which returns the name of the SourceModel
.
A toString
method which returns a String
representation of the model like the one shown below under Running Your Program in jshell.
A probability
method which takes a String
and returns a double
which indicates the probability that the test string was generated by the source model, using the transition probability matrix created in the constructor. Here’s a recommended algorithm:
test
, and for to , multiply the probability by the entry in the transition probability matrix for the to transition, which should be found in row an column in the matrix. (You could also think of the indices as for to .)A main
method that makes SourceModel
runnable from the command line. You program should take 1 or more corpus file names as command line arguments followed by a quoted string as the last argument. The program should create models for all the corpora and test the string with all the corpora. Here’s an algorithm sketch:
The first n-1 arguments to the program are corpus file names to use to train models. Corpus files are of the form
The last argument to the program is a quoted string to test.
Create a SourceModel object for each corpus
Use the models to compute the probability that the test text was produced by the model
Probabilities will be very small. Normalize the probabilities of all the model predictions to a probability distribution (so they sum to 1) (closed-world assumption – we only state probabilities relative to models we have).
Print results of analysis
Sample runs from the command line:
$ java SourceModel *.corpus "If you got a gun up in your waist please don't shoot up the place (why?)"
Training english model ... done.
Training french model ... done.
Training hiphop model ... done.
Training lisp model ... done.
Training spanish model ... done.
Analyzing: If you got a gun up in your waist please don't shoot up the place (why?)
Probability that test string is english: 0.00
Probability that test string is french: 0.00
Probability that test string is hiphop: 1.00
Probability that test string is lisp: 0.00
Probability that test string is spanish: 0.00
Test string is most likely hiphop.
$ java SourceModel *.corpus "Ou va le monde?"
Training english model ... done.
Training french model ... done.
Training hiphop model ... done.
Training lisp model ... done.
Training spanish model ... done.
Analyzing: Ou va le monde?
Probability that test string is english: 0.02
Probability that test string is french: 0.85
Probability that test string is hiphop: 0.01
Probability that test string is lisp: 0.10
Probability that test string is spanish: 0.01
Test string is most likely french.
$ java SourceModel *.corpus "My other car is a cdr."
Training english model ... done.
Training french model ... done.
Training hiphop model ... done.
Training lisp model ... done.
Training spanish model ... done.
Analyzing: My other car is a cdr.
Probability that test string is english: 0.39
Probability that test string is french: 0.00
Probability that test string is hiphop: 0.61
Probability that test string is lisp: 0.00
Probability that test string is spanish: 0.00
Test string is most likely hiphop.
$ java SourceModel *.corpus "defun Let there be rock"
Training english model ... done.
Training french model ... done.
Training hiphop model ... done.
Training lisp model ... done.
Training spanish model ... done.
Analyzing: defun Let there be rock
Probability that test string is english: 0.01
Probability that test string is french: 0.00
Probability that test string is hiphop: 0.42
Probability that test string is lisp: 0.57
Probability that test string is spanish: 0.00
Test string is most likely lisp.
Sample runs from jshell:
$ jshell
| Welcome to JShell -- Version 10.0.2
| For an introduction type: /help intro
jshell> /open SourceModel.java
jshell> var french = new SourceModel("french", "french.corpus")
Training french model ... done.
french ==> Model: french
a b c d e f ... 1.00 0.01 0.01 0.01 0.01
jshell> System.out.println(french) // implicitly calls french.toString()
Model: french
a b c d e f g h i j k l m n o p q r s t u v w x y z
a 0.01 0.03 0.03 0.02 0.01 0.01 0.03 0.01 0.26 0.01 0.01 0.07 0.07 0.13 0.01 0.06 0.01 0.09 0.06 0.04 0.06 0.05 0.01 0.01 0.01 0.01
b 0.07 0.01 0.01 0.03 0.14 0.01 0.01 0.01 0.07 0.01 0.01 0.21 0.01 0.01 0.14 0.01 0.01 0.24 0.01 0.03 0.07 0.01 0.01 0.01 0.01 0.01
c 0.04 0.02 0.02 0.01 0.26 0.01 0.01 0.19 0.06 0.01 0.01 0.08 0.02 0.01 0.15 0.01 0.01 0.11 0.01 0.01 0.06 0.01 0.01 0.01 0.01 0.01
d 0.14 0.01 0.01 0.01 0.39 0.01 0.01 0.01 0.13 0.01 0.01 0.03 0.01 0.01 0.11 0.01 0.01 0.07 0.03 0.01 0.07 0.01 0.01 0.01 0.01 0.01
e 0.04 0.01 0.04 0.05 0.07 0.01 0.01 0.01 0.01 0.04 0.00 0.07 0.05 0.13 0.01 0.04 0.01 0.07 0.15 0.14 0.06 0.00 0.00 0.01 0.01 0.00
f 0.15 0.01 0.01 0.01 0.23 0.01 0.01 0.01 0.08 0.01 0.01 0.08 0.01 0.01 0.23 0.01 0.01 0.15 0.08 0.01 0.01 0.01 0.01 0.01 0.01 0.01
g 0.01 0.01 0.01 0.01 0.27 0.01 0.01 0.01 0.09 0.01 0.01 0.18 0.05 0.09 0.05 0.01 0.01 0.23 0.01 0.01 0.05 0.01 0.01 0.01 0.01 0.01
h 0.43 0.01 0.01 0.07 0.14 0.01 0.01 0.01 0.07 0.01 0.01 0.07 0.01 0.01 0.21 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
i 0.03 0.02 0.04 0.04 0.16 0.01 0.04 0.01 0.01 0.01 0.01 0.11 0.06 0.09 0.03 0.02 0.01 0.03 0.15 0.14 0.01 0.01 0.01 0.01 0.01 0.01
j 0.24 0.01 0.01 0.01 0.53 0.01 0.01 0.01 0.03 0.01 0.01 0.01 0.01 0.01 0.06 0.01 0.01 0.01 0.01 0.01 0.15 0.01 0.01 0.01 0.01 0.01
k 0.50 0.01 0.01 0.01 0.50 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
l 0.20 0.01 0.01 0.01 0.46 0.01 0.01 0.01 0.07 0.01 0.01 0.11 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.06 0.01 0.01 0.01 0.01 0.01
m 0.22 0.16 0.01 0.01 0.26 0.01 0.01 0.01 0.10 0.01 0.01 0.01 0.06 0.01 0.12 0.04 0.01 0.01 0.01 0.01 0.03 0.01 0.01 0.01 0.01 0.01
n 0.06 0.01 0.03 0.13 0.16 0.04 0.01 0.01 0.05 0.03 0.01 0.02 0.01 0.04 0.03 0.01 0.04 0.01 0.08 0.22 0.02 0.01 0.01 0.01 0.01 0.01
o 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.09 0.01 0.01 0.03 0.06 0.24 0.01 0.02 0.01 0.18 0.04 0.01 0.28 0.01 0.02 0.01 0.01 0.01
p 0.25 0.01 0.01 0.02 0.11 0.01 0.01 0.02 0.02 0.01 0.01 0.13 0.01 0.01 0.20 0.05 0.01 0.13 0.05 0.01 0.04 0.01 0.01 0.01 0.01 0.01
q 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 1.00 0.01 0.01 0.01 0.01 0.01
r 0.20 0.01 0.03 0.02 0.30 0.01 0.01 0.01 0.08 0.01 0.01 0.06 0.01 0.01 0.05 0.01 0.01 0.03 0.05 0.12 0.02 0.01 0.01 0.01 0.01 0.01
s 0.07 0.02 0.05 0.04 0.15 0.01 0.01 0.01 0.10 0.03 0.01 0.06 0.01 0.01 0.09 0.06 0.03 0.01 0.05 0.09 0.10 0.03 0.01 0.01 0.01 0.01
t 0.13 0.01 0.01 0.04 0.19 0.01 0.01 0.01 0.05 0.04 0.01 0.08 0.03 0.01 0.13 0.01 0.02 0.08 0.01 0.03 0.12 0.01 0.01 0.01 0.01 0.01
u 0.04 0.01 0.02 0.01 0.10 0.01 0.01 0.01 0.07 0.01 0.01 0.05 0.02 0.20 0.01 0.02 0.01 0.24 0.12 0.05 0.02 0.01 0.01 0.01 0.01 0.01
v 0.26 0.01 0.01 0.01 0.37 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.26 0.01 0.01 0.11 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
w 0.01 0.01 0.01 0.67 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.33 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
x 0.01 0.01 0.14 0.01 0.14 0.01 0.01 0.01 0.29 0.01 0.01 0.01 0.01 0.14 0.01 0.14 0.01 0.01 0.01 0.01 0.14 0.01 0.01 0.01 0.01 0.01
y 0.50 0.01 0.25 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.25 0.01 0.01 0.01 0.01 0.01 0.01 0.01
z 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 1.00 0.01 0.01 0.01 0.01
jshell> french.probability("Il y a tout ce que vous voulez aux Champs-Elysees")
$8 ==> 3.966845096265183E-43
toString
.FileReader
’s read
method returns int
. You’ll probably want to cast these to char
s. That’s fine. As the documentation says, the lower 16 bits are the Unicode code point for a character.If you use String.split
to get corpus names from file names, remember that .
is a special regex character. Use a character class to match a literal .
character. For example "foo.fighters".split("[.]")
is ["foo", "fighters"]
.
char
is an integral type, so you can easily find a char
’s offset from 'a'
with an expression like ch - 'a'
, where ch
is a char
variable.
The Character
class has many static utility methods you will find useful, like isAlphabetic
, toLowerCase
.
For each of your homework assignments we will run checkstyle and deduct one point for every checkstyle error.
For this homework the checkstyle cap is 10. This limit will increase with each homework.
java -jar checkstyle-6.2.2.jar *.java
.java -jar checkstyle-6.2.2.jar -j *.java
.When completing homeworks for CS1331 you may talk with other students about:
OKAY: “Hey, I’m really confused on how we are supposed to implement this part of the homework. What strategies/resources did you use to solve it?”
BY NO MEANS OKAY: “Hey… the homework is due in like 20 minutes… Can I see your code? I promise won’t copy it directly!”
In addition to the above rules, note that it is not allowed to upload your code to any sort of public repository. This could be considered an Honor Code violation, even if it is after the homework is due.
Submit your Java source file as an attachment to the hw2
assignment on Canvas. You can submit as many times as you want, so feel free to submit as you make substantial progress on the homework. We only grade your last submission, meaning we will ignore any previous submissions. Please be aware that Canvas will append a number at the end of the file name if multiple submissions are made (e.g. YourFile-1.java
). We will take this into consideration as we grade and remove the appended number of the last submission.
As always, late submissions will not be accepted and non-compiling code will be given a score of 0. For this reason, we recommend submitting early and then confirming that you submitted ALL of the necessary files by re-downloading your file(s) and compiling/running them.