Splitting a string into all possible 4-letter sequential phrases
What I am trying to do is basically this:
- Read the file;
- Remove all punctuation marks and convert all letters to lowercase;
- Convert words to 4 letter phrases (if the word is shorter than 4 characters, take it as a whole);
Example:
Login: Hello, my identity is Mister Dude.
Result: hell, ello, my, iden, dent, enti, ntif, tifi, ific, fica, icat, cati, atio, tion, is, mist, iste, ster, dude.
It would be nice if I could get each 4 word phrase as a separate value in an array.
Now all I managed to accomplish:
public String[] OpenFile() throws IOException {
FileReader fr = new FileReader(path);
BufferedReader textReader = new BufferedReader(fr);
int numberOfLines = readLines();
String[] textData = new String[numberOfLines];
int i;
for (i = 0; i < numberOfLines; i++) {
textData[i] = textReader.readLine();
textData[i] = textData[i].replaceAll("[^A-Za-ząčęėįšųūž]+", " ").toLowerCase();
}
textReader.close();
return textData;
}
textData[i]
is each line of text that I need to split. I have tried several methods such as .toCharArray
2D arrays, but I cannot describe the letter layout part. How can I complete task # 3?
source to share
Tested on ideone.com :
public static void main (String[] args) {
String text = "Hello, my identification is Mister Dude.";
String[] words = text.replaceAll("[^(\\w )]+", "").toLowerCase().split(" ");
for (String word : words) {
if (word.length() <= 4) {
System.out.println(word);
}
else {
for (int i = 0; i <= word.length() - 4; i++) {
System.out.println(word.substring(i, i + 4));
}
}
}
}
source to share
Basically, for each word, you need to iterate over the possible positions to start a four-letter sequence of:
public static List<String> sequences (String line) {
List result = new LinkedList<>();
String[] words = line.split(" ");
for (String word : words) {
if (word.length() <= 4) {
result.add(word);
} else {
for (int i = 0; i <= word.length() - 4; ++i) {
result.add(word.substring(i, i + 4));
}
}
}
return result;
}
source to share
Example by command:
List<String> result = new ArrayList<String>();
for (int i = 0; i < textData.length; i++) {
String[] currLine = textData[i].split("\\s+");
for (String word : currLine) {
if (word.length() > 4) {
for (int j = 0; j < currLine.length - 4; j++) {
result.add(word.substring(j, j + 4));
}
} else {
result.add(word);
}
}
}
I have not tested it, so please check and let me know if it works.
source to share
First you need to separate your methods with spaces and punctuation marks. Notice the division on line 3 that breaks into any combination of spaces and punctuation marks.
In my example, I had
String text = "Hello, my identification is Mister Dude.";
String[] textArray = text.split("\\W+");
List<String> result = new ArrayList<>();
for (String word : textArray) {
result.addAll(Arrays.asList(split(word.toLowerCase(), 4)));
}
and then the method
private static String[] split(String word, int letters) {
if (word == null || word.length() == 0) {
return new String[] {};
} else if (word.length() <= letters) {
return new String[] { word };
} else {
int quantity = (word.length() - letters) + 1;
String[] val = new String[quantity];
int a = 0;
while (a + letters <= word.length()) {
val[a] = word.substring(a, a + letters);
a++;
}
return val;
}
}
Outputs the following
[hell, ello, my, iden, dent, enti, ntif, tifi, ific, fica, icat, cati, atio, tion, is, mist, iste, ster, dude]
source to share