How can I approach a lot of structured but inconsistent data?

I'm trying to parse EDGAR documents - these are SEC filings. Specifically, I am trying to analyze both SEC Schedule 13D and Schedule 13G .

It looks like there were many unsuccessful attempts when trying to parse these recordings, and I guess that since this is a behemoth task that the whole team will have to solve.

I was instructed to parse these records. We need information from datasheets found all over the place. The problem is that the records in the record make it difficult to distinguish between data points, table section headers, etc.

So far, I have been able to clear information from about 10% of Schedule 13D files, and even what I have cleared requires significant cleaning. In a nutshell, I am matching a regex pattern to text. The template takes one well-known (English) section heading and the next one (I set each one manually) and extracts what's in between: for example CHECK THE APPROPRIATE BOX IF A MEMBER OF A GROUP(.*?)SEC USE ONLY

. It is clear that this will not take me far, and it does not. Using the same logic, here is what I get based on the following example line (as an example):

sample text

NAMES OF REGISTERING PERSONS IRS IDENTIFICATION. FOR THE ABOVE PERSONS (FOR PERSONS ONLY) Robert DePaloCheckK APPROXIMATE BOX IF A GROUP MEMBER (see instructions) (a)  (b) SEC USE ONLY FUNDS TOOL (see Instructions) CHECK THE BOX IF YOU OPEN 2 (d) or 2 (e) "Nationality or place ORGANIZATION Dismissed states VOTING VOTING45,119,857 (1) GENERAL POWER-0-SOLE DISPOSITIVE POWER45,119,857 (1) 10.SHARED DISPOSITIVE POWER-0-11. REPORTING PERSON45,119,857 (1) 12. CHECK BOX IF THE AGGREGATE AMOUNT IN ROW (11) EXCLUDES CERTAIN SHARES (see instructions) 13. PERSONS (see instructions) (1) Consists of 44,194,298 shares Ordinary shares,owned by the reporting entity and 925,559 ordinary shares Shares are held by Arjent Limited UK. The speaker is the Chairman of Arjent Limited UK and has voting and investment power over the shares held by him. Does not include any classes of preferred shares that the reporting person and the entity owned by the wife of the Accounting Division are entitled to receive as discussed in paragraph 6 below. (2) Does not include the voting interest that the reporting person is entitled to receive the SPHC Series Preferred Shares as discussed in clause 6 of this Schedule 13D.Does not include any classes of preferred shares that the reporting person and the entity owned by the wife of the Accounting Division are entitled to receive as discussed in paragraph 6 below. (2) Does not include the voting interest which the reporting person is entitled to receive SPHC Series Preferred Shares as discussed in clause 6 of this Schedule 13D.Does not include any classes of preferred shares that the reporting person and the entity owned by the wife of the Accounting Division are entitled to receive as discussed in paragraph 6 below. (2) Does not include the voting interest that the reporting person is entitled to receive the SPHC Series Preferred Shares as discussed in clause 6 of this Schedule 13D.

example output key: CHECK THE | v: (a)    (b)     key: CITIZENSHI | v: United States key: CHECK BOX | v:       key: SHARED VOT | v: -0- key: PERCENT OF | v: PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW \(11\) key: TYPE OF RE | v: TYPE OF REPORTING PERSON \(see instructions\) key: CHECK BOX | v:     13. key: SOLE DISPO | v: 45,119,857 key: SEC USE ON | v: SEC USE ONLY key: SHARED DIS | v: -0 key: SOLE VOTIN | v: 45,119,857 key: NAMES OF R | v: Robert DePalo key: AGGREGATE | v: 45,119,857 12. key: SOURCE OF | v: SOURCE OF FUNDS \(see instructions\)

Are there any other approaches? This does not work for most 13D applications, and it does not work for 13G. I have a feeling that I am too naive in my approach, and I need a general approach to such a problem. I am looking to clear at least 80%, at least 80% of the applications.

+3


source to share





All Articles