Tika detects polypart / signature

I am using Tika to automatically detect the content type of documents placed in the DMS. Almost everything works fine except for emails.

I need to differentiate between standard email messages (mime => message / rfc822) and signed email messages (mime => multipart / signed), but all email messages show up as message / rfc822.

Signed mail that is not being detected correctly has the following content type header:

Content-Type: multipart/signed; protocol="application/x-pkcs7-signature"; micalg=sha1; boundary="----4898E6D8BDE1929CA602BE94D115EF4C"

      

The java code I'm using for parsing is:

Detector detector;
List<Detector> detectors = new ArrayList<Detector>();
detectors.add(new ZipContainerDetector());
detectors.add(new POIFSContainerDetector());
detectors.add(MimeTypes.getDefaultMimeTypes());
detector = new CompositeDetector(detectors);
String mimetype = detector.detect(TikaInputStream.get(new File(args[0])), new Metadata()).toString();

      

I reference the core libraries and tika-parsers to discover the pdf and msword documents as well. Did I miss something?

+3


source to share


1 answer


I solved my problem. I have implemented a custom detector by implementing the interface Detector

:

public class MultipartSignedDetector implements Detector {

  @Override
  public MediaType detect(InputStream is, Metadata metadata) throws IOException {

    TemporaryResources tmp = new TemporaryResources();

    TikaInputStream tis = TikaInputStream.get(is, tmp);
    tis.mark(Integer.MAX_VALUE);

    try {

      MimeMessage mimeMessage = null;
      String host = "host.com";
      Properties properties = System.getProperties();
      properties.setProperty("mail.smtp.host", host);
      Session session = Session.getDefaultInstance(properties);

      mimeMessage = new MimeMessage(session, tis);

      if(mimeMessage.getContentType() != null && mimeMessage.getMessageID() != null && mimeMessage.getContentType().toLowerCase().contains("multipart/signed"))
        return new MediaType("multipart", "signed");
      else
        return MediaType.OCTET_STREAM;

    } catch(Exception e) {
      return MediaType.OCTET_STREAM;
    } finally {
      try {
        tis.reset();
        tmp.dispose();
      } catch (TikaException e) {
        // ignore
    }
  }
 }
}

      



Then add the custom detector to the composite detector just before the default:

Detector detector;
List<Detector> detectors = new ArrayList<Detector>();
detectors.add(new ZipContainerDetector());
detectors.add(new POIFSContainerDetector());

detectors.add(new MultipartSignedDetector());

detectors.add(MimeTypes.getDefaultMimeTypes());
detector = new CompositeDetector(detectors);
String mimetype = detector.detect(TikaInputStream.get(new File(args[0])), new Metadata()).toString();

      

+1


source







All Articles