Fast working on large lists

  • Автор темы Автор темы qlwik
  • Дата начала Дата начала

qlwik

Client
Регистрация
03.04.2013
Сообщения
208
Реакции
5
Баллы
18
Hi,
I need to work on large lists and I don't know how to do it fast.

For example I got list with 3M urls and I want to keep 5 urls from every single domain and delete rest. I was trying doing this with load data from file, regex and output, also I was trying to do it with lists, but all this is too slow, I would have to wait year to complete this list.

Is it possible to do it with ZP fast?
 
What operations with list did you use?
Did you try Delete lines>matching regex or similar?
 
I have made like that:
1. list connected with large file -> 2. get first line -> 3. extract domain from it with regex -> 4. loop repeated X times that will get and delete line containing text (domain) and save it to another list connected with other file -> 5. delete all lines containing text (domain) -> 6. go back to point 2.

X is variable loaded from file only once on the beginning.
 
I have made like that:
1. list connected with large file -> 2. get first line -> 3. extract domain from it with regex -> 4. loop repeated X times that will get and delete line containing text (domain) and save it to another list connected with other file -> 5. delete all lines containing text (domain) -> 6. go back to point 2.

X is variable loaded from file only once on the beginning.
Get Line.png
 
Hey. I gave you easier way to get lines with domain from list. Did you try that?
 
But it will take only one line, so I need to make a loop anyway, maybe I don't understand something?
 
I do someting like this a little time ago, but it work with milion of e-mail's.
Код:
Развернуть Свернуть Копировать
var tbl = project.Tables["Table1"];
var lst = project.Lists["List1"];
int i = 0;
int s = tbl.RowCount;
string data = "";
try{
   while (s != 0){
     string str = tbl.GetCell("A",0);
     List<int> found = new List<int>();
     int qtd = 0;
     int ini = str.IndexOf("@",0) + 1;
     int end = str.IndexOf(".",ini) - ini;
     string domain = str.Substring(ini,end);
     int j = 0;
     while (qtd < 5){
         if (j > tbl.RowCount) break;
         data = tbl.GetCell("A",j);
         if (data.Contains(domain)){
           lst.Add(data);
           qtd++;
         }
         j++;
       }
     j = 0;
     while (j < tbl.RowCount){
       data = tbl.GetCell("A",j);
       if (data.Contains(domain)){
         found.Add(j);
       }
       j++;
     }
     
     tbl.DeleteRow(found);
     found.Clear();
     s = tbl.RowCount;
   }
}
catch{
   project.SendErrorToLog("end");
}

For domains the secret is the position of your substring "domain" on this part of code:
Код:
Развернуть Свернуть Копировать
int ini = str.IndexOf("@",0) + 1; //@ on email
int end = str.IndexOf(".",ini) - ini; //first dot after @
string domain = str.Substring(ini,end); //myemayl@yahoo.com will be yahoo

for 1.4 milion e-mails that code need 2s to process a list.
 

Вложения

  • demo.xmlz
    demo.xmlz
    14,6 KB · Просмотры: 171
Последнее редактирование:
  • Спасибо
Реакции: Vvafel, Astraport и qlwik
Ok guys thanks for help, I will try both solutions.
 

Кто просматривает тему: (Всего: 0, Пользователи: 0, Гости: 0)