Split the list of objects into two lists according to the ratio by size and division of objects into sections

I have a list of objects, call it dataset

and I want to split into two new lists, call them trainSet

and testSet

when reading the input line dataset

by line, where the list object inherits from the node

read stream object (in this particular case).

We then define a bifurcated ratio split_ratio=0.75

to divide dataset

by a trainSet

at least 75% of dataset

and 25% of the length of the list testSet

(for example):

var dataSize = dataset.length,
    lim_75 = 0,
    lim_25 = 0;

lim_75 = Math.floor(dataSize * split_ratio); // e.g.: .75
if (lim_75 < 1) lim_75 = 1;
lim_25 = Math.floor(dataSize * (1 - split_ratio)); // eg: 1-.75=.25
if (lim_25 < 1) lim_25 = 1;

      

When reading line by line, I will do

while (row = dataset.next()) {
    var r = Math.random(); // split randomness seed
    if (r < split_ratio && lim_75_count < lim_75 || r >= split_ratio && lim_25_count > lim_25) {
        lim_75_count += 1;
        trainSet.write(row);
    } else {
        lim_25_count += 1;
        testSet.write(row);
    }
}

      

This works, I get at the end two lists trainSet

of 0.75 size and testSet

from 0.25 of the list length dataset

.

Now, suppose my objects in a list dataset

have a structure like

{
 objectId: 12345,
 objectClass: 'CLASS_A`
 objectValue: 'The quick brown fox jumps over the lazy dog'
}

      

where objectClass

belongs to an enumeration of values ​​such as CLASS_A

, CLASS_B

etc.

I want to keep the above split_ratio

for the ratio of length trainSet

and testSet

, but adding a new condition that would allow me to divide dataset

by a new ratio to split the list into the object key objectClass

, according to this partition map:

[
{
 objectClass: 'TYPE_A',
 ratio: 0.30
},
{
 objectClass: 'TYPE_B',
 ratio: 0.20
},
{
 objectClass: 'TYPE_C',
 ratio: 0.50
}
]

      

What's the correct splitting condition that will work AND

with the first based on size?

Example

Real-world examples to clarify: Suppose that split_ratio=.70

you have to have trainSet.length=70

, testSet.length=30

if `dataset.length = 100.

Now given two types objectClass

with ratio

values

CLASS_A=0.20

and CLASS_B=0.80

, we must have 0.2*70/100

elements CLASS_A

at trainSet

and 0.80*30/100

at testSet

, while we will have 0.80*70/100

elements CLASS_B

at trainSet

and 0.20*30/100

elements CLASS_B

at testSet

.

In this case, it is assumed that the partition map contains only types CLASS_A

and CLASS_B

, such as:

[
{
     objectClass: 'TYPE_A',
     ratio: 0.20
    },
    {
     objectClass: 'TYPE_B',
     ratio: 0.80
    }
]

      

+3


source to share





All Articles