Text clustering in a log file

I am working on the problem of finding similar content in a log file. Let's say I have a log file that looks like this:

 show version
 Operating System (OS) Software

 Software
 BIOS:      version 1.0.10
 loader:    version N/A
 kickstart: version 4.2(7b)
 system:    version 4.2(7b)
 BIOS compile time:       01/08/09
 kickstart image file is: bootflash:/m9500-sf2ek9-kickstart-mz.4.2.7b.bin
 kickstart compile time:  8/16/2010 13:00:00 [09/29/2010 23:10:48]
 system image file is:    bootflash:/m9500-sf2ek9-mz.4.2.7b.bin
 system compile time:     8/16/2010 13:00:00 [09/30/2010 00:46:36]`

 Hardware
 xxxx MDS 9509 (9 Slot) Chassis ("xxxxxxx/xxxxx-2")
 xxxxxxx, xxxx with 1033100 kB of memory.
 Processor Board ID xxxx

 Device name: xxx-xxx-1 
 bootflash:    1000440 kB 
 slot0:              0 kB (expansion flash)

      

It is easy for the human eye to understand that "Software" and the data below represent a section and "Hardware", and the data below is another section. Is there a way I can simulate using machine learning or some other method to group similar sections based on a template? In addition, I have shown two similar types of template, but the templates between sections can differ and therefore should be identified as different sections. I tried to find the similarity using the similarity of cosine, but it doesn't help much, because the words are not similar, but the pattern.

+3


source to share


1 answer


I see actually two separate machine learning problems:

1) If I understood correctly, the first problem you want to solve is the problem of dividing each log into a separate section, so one for Hardware, one for software, etc.

To achieve this approach, you can try to extract the heading that marks the beginning of a new section. To do this, you can manually target a set of different logs and mark each line as title = true, title = false

No, you could try to train a classifier that takes your tagged data as input and the result could be a model.

2) Now that you have these different sections, you can break each journal into that section and treat each section as a separate document.



Now, I'll first try clustering paginated documents using the standard nlp pipeline:

  • Label your document to get tokens.
  • Normalize them (maybe not a good idea for logs)
  • Create a tf-idf vector for each document
  • Start with a simple clustering algorithm like k-means to try to group another section

After clustering, you should have a section similar to each other in the same cluster

I hope this helped, I think especially the first task is to fail and maybe manual models will work better.

+1


source







All Articles