What is a reasonably size for an array?
I have a dataset with 100K rows per year, and ideally I want to be able to process multiple years in one go. Every row contains 20 columns of sales data: event names, filenames, dates, amounts, reference numbers, other numbers, customer IDs and demographic data. So lots of strings, text that doesn’t always fit the default Excel grid. My current model processes 3300 lines per second, which is fine. But it’s mostly the structure in the code that bothers me.
The data comes from multiple files, each with certain metrics used as input for the modifications. Right now I process this dataset in blocks of 30.000 rows from left to right. Each column is loaded into an array, modifications are done, output is assigned to a new sheet. Next column etc.
My first concern was that if I loaded 100K rows into an array, that could give problems. So I thought splitting it based on source would be a safe bet to work with smaller sizes. But testing it with 100K rows (column for column) was equally as fast with no distinguishable difference. Now I wonder, could I just load an entire source (30.000 rows – 20 columns) into an array, do my calculations and then output that at once? What is a reasonable size? And when will you get in trouble?