Looping through many lines

Question

Looping through many lines

I am having a timing issue with a loop of 1 million potential rows from the database. I am basically pulling rows into the DataTable and scrolling through them, but it gets slower. What is the alternative? I can break these rows into 20,000 pieces. Can parallel processing be used in C #? Basically, the code goes through each potential record that matches a particular query and tries to figure out if it is a legitimate record. This is why each entry needs to be visited individually. The record for one object can be up to 10 million lines. Approaches seem to be parallel processing on multiple computers or PP in a single machine with multiple cores or with data structure / approaches change?

Any opinions, thoughts, and guesses are helpful to do this quickly and intelligently?

+3

c # bigdata

iefpw 11 Mar 12 at 1:41

source to share

2 answers

I would suggest a parallel loop using a dual core machine, and also tried to use shared lists for each loop, I think this might make your process faster.

0

Ali Issa 11 Mar 12 at 1:55

source to share

ntziolis · Accepted Answer · 2012-03-11T01:54:49+0000

First: Do not use DataTable

for operations such as these :

he is slow
it consumes too much memory
and you need to wait a long time before you can start processing data
- During this time, the additional kernels do nothing, since the data reading in is DataTable
  
  not parabolic.
- Also, reading data tends to use almost all of the CPU as the main cause is network or other I / O lag.

So again: don't use DataTable

for such operations.

Use instead DataReader

. This allows you to start consuming / processing data immediately, rather than waiting for it to be loaded. The simplest version would be (sample for MS SQL Server):

var command = new SqlCommand()
{
  CommandText = "SELECT * FROM Table";
  Connection = new SqlConnection("InsertConnectionString");
};

using(var reader = command.ExecuteReader())
{
  while(reader.Read())
  {
    var values = new object[reader.FieldCount];
    reader.GetValues(values);

    // process values of row
  }
}

The reader will be blocked while your processing code is running, which means that no more lines are read from the database.
If the processing code is heavy, you might need to use a library Task

to create tasks that do validation, which will allow you to use multiple cores. However, there is a build overhead Task

, if one Task

doesn't contain enough "work", you can execute a couple of lines together:

public void ReadData()
{
  var taskList = new List<Task<SomeResultType>>();

  var command = new SqlCommand()
  {
    CommandText = "SELECT * FROM Table";
    Connection = new SqlConnection("InsertConnectionString");
  };
  using(var reader = command.ExecuteReader())
  {
    var valueList = new List<object[]>(100);
    while(reader.Read())
    {
      var values = new object[reader.FieldCount];
      reader.GetValues(values);

      valueList.Add(values);

      if(valueList.Count == 100)
      {
        var localValueList = valueList.ToList();
        valueList.Clear();

        taskList.Add(Task<SomeResultType>.Factory.StartNew(() => Process(localValueList));
      }
    }
    if(valueList.Count > 0)
      taskList.Add(Task<SomeResultType>.Factory.StartNew(() => Process(valueList));
  }

  // this line completes when all tasks are done
  Task.WaitAll(taskList.ToArray());
}

public SomeResultType Process(List<object[]> valueList)
{
  foreach(var vals in valueList)
  {
    // put your processing code here, be sure to synchronize your actions properly
  }  
}

Lot size (currently 100) depends on actual processing and may need to be adjusted.
Synchronization has its own problems, you need to be very careful about shared resources.

Looping through many lines

More articles: