Fast working on large lists

qlwik · 23.04.2019

Hi,
I need to work on large lists and I don't know how to do it fast.

For example I got list with 3M urls and I want to keep 5 urls from every single domain and delete rest. I was trying doing this with load data from file, regex and output, also I was trying to do it with lists, but all this is too slow, I would have to wait year to complete this list.

Is it possible to do it with ZP fast?

VladZen · 23.04.2019

What operations with list did you use?
Did you try Delete lines>matching regex or similar?

qlwik · 23.04.2019

I have made like that:
1. list connected with large file -> 2. get first line -> 3. extract domain from it with regex -> 4. loop repeated X times that will get and delete line containing text (domain) and save it to another list connected with other file -> 5. delete all lines containing text (domain) -> 6. go back to point 2.

X is variable loaded from file only once on the beginning.

VladZen · 24.04.2019

qlwik написал(а):
I have made like that:
1. list connected with large file -> 2. get first line -> 3. extract domain from it with regex -> 4. loop repeated X times that will get and delete line containing text (domain) and save it to another list connected with other file -> 5. delete all lines containing text (domain) -> 6. go back to point 2.

X is variable loaded from file only once on the beginning.

qlwik · 24.04.2019

qlwik · 24.04.2019

temp is a list connected with big file

VladZen · 24.04.2019

Hey. I gave you easier way to get lines with domain from list. Did you try that?

qlwik · 24.04.2019

But it will take only one line, so I need to make a loop anyway, maybe I don't understand something?

VladZen · 24.04.2019

qlwik написал(а):
But it will take only one line, so I need to make a loop anyway, maybe I don't understand something?

Ok, do it like this

EtaLasquera · 24.04.2019

I do someting like this a little time ago, but it work with milion of e-mail's.

Код:

var tbl = project.Tables["Table1"];
var lst = project.Lists["List1"];
int i = 0;
int s = tbl.RowCount;
string data = "";
try{
   while (s != 0){
     string str = tbl.GetCell("A",0);
     List<int> found = new List<int>();
     int qtd = 0;
     int ini = str.IndexOf("@",0) + 1;
     int end = str.IndexOf(".",ini) - ini;
     string domain = str.Substring(ini,end);
     int j = 0;
     while (qtd < 5){
         if (j > tbl.RowCount) break;
         data = tbl.GetCell("A",j);
         if (data.Contains(domain)){
           lst.Add(data);
           qtd++;
         }
         j++;
       }
     j = 0;
     while (j < tbl.RowCount){
       data = tbl.GetCell("A",j);
       if (data.Contains(domain)){
         found.Add(j);
       }
       j++;
     }
     
     tbl.DeleteRow(found);
     found.Clear();
     s = tbl.RowCount;
   }
}
catch{
   project.SendErrorToLog("end");
}

For domains the secret is the position of your substring "domain" on this part of code:

Код:

int ini = str.IndexOf("@",0) + 1; //@ on email
int end = str.IndexOf(".",ini) - ini; //first dot after @
string domain = str.Substring(ini,end); //myemayl@yahoo.com will be yahoo

for 1.4 milion e-mails that code need 2s to process a list.

qlwik · 25.04.2019

Ok guys thanks for help, I will try both solutions.

Поиск

Fast working on large lists

qlwik

Client

VladZen

Administrator

qlwik

Client

VladZen

Administrator

qlwik

Client

qlwik

Client

VladZen

Administrator

qlwik

Client

VladZen

Administrator

EtaLasquera

Client

Вложения

qlwik

Client

Кто просматривает тему: (Всего: 0, Пользователи: 0, Гости: 0)