Text clustering in a log file
I am working on the problem of finding similar content in a log file. Let's say I have a log file that looks like this:
show version
Operating System (OS) Software
Software
BIOS: version 1.0.10
loader: version N/A
kickstart: version 4.2(7b)
system: version 4.2(7b)
BIOS compile time: 01/08/09
kickstart image file is: bootflash:/m9500-sf2ek9-kickstart-mz.4.2.7b.bin
kickstart compile time: 8/16/2010 13:00:00 [09/29/2010 23:10:48]
system image file is: bootflash:/m9500-sf2ek9-mz.4.2.7b.bin
system compile time: 8/16/2010 13:00:00 [09/30/2010 00:46:36]`
Hardware
xxxx MDS 9509 (9 Slot) Chassis ("xxxxxxx/xxxxx-2")
xxxxxxx, xxxx with 1033100 kB of memory.
Processor Board ID xxxx
Device name: xxx-xxx-1
bootflash: 1000440 kB
slot0: 0 kB (expansion flash)
It is easy for the human eye to understand that "Software" and the data below represent a section and "Hardware", and the data below is another section. Is there a way I can simulate using machine learning or some other method to group similar sections based on a template? In addition, I have shown two similar types of template, but the templates between sections can differ and therefore should be identified as different sections. I tried to find the similarity using the similarity of cosine, but it doesn't help much, because the words are not similar, but the pattern.
source to share
I see actually two separate machine learning problems:
1) If I understood correctly, the first problem you want to solve is the problem of dividing each log into a separate section, so one for Hardware, one for software, etc.
To achieve this approach, you can try to extract the heading that marks the beginning of a new section. To do this, you can manually target a set of different logs and mark each line as title = true, title = false
No, you could try to train a classifier that takes your tagged data as input and the result could be a model.
2) Now that you have these different sections, you can break each journal into that section and treat each section as a separate document.
Now, I'll first try clustering paginated documents using the standard nlp pipeline:
- Label your document to get tokens.
- Normalize them (maybe not a good idea for logs)
- Create a tf-idf vector for each document
- Start with a simple clustering algorithm like k-means to try to group another section
After clustering, you should have a section similar to each other in the same cluster
I hope this helped, I think especially the first task is to fail and maybe manual models will work better.
source to share