Regex Extraction Help Please

djljzenno

Client
Регистрация
26.12.2013
Сообщения
43
Благодарностей
2
Баллы
8
Hello!

I am having a problem extracting the data from these pages as the tags change throughout the page.

Can someone help me with the regex?

http://en.wikipedia.org/wiki/List_of_Billboard_Hot_100_number-one_singles_of_1990

What I need is the "Song" and "Artist" text only so that I can save to csv file.

Example data output desired:

"Another Day in Paradise","Phil Collins"
"How Am I Supposed to Live Without You","Michael Bolton"

Thanks for your help!
 

LexxWork

Client
Регистрация
31.10.2013
Сообщения
1 190
Благодарностей
789
Баллы
113
use links like
http://en.m.wikipedia.org/w/index.php?mobileaction=toggle_view_mobile&title=List_of_Billboard_Hot_100_number-one_singles_of_1998
this code saves all needed data to the [project directory] + "\" + title + ".csv"
C#:
string table =  string.Join("\r\n", 
    instance.ActiveTab.FindElementsByXPath("//table[@class='wikitable'][2]/tbody/tr").Elements
    .Where(e=>e.GetChildren(false).Count == 4)
    .Skip(1)
    .Select<HtmlElement, string>(e=>{
        var t = e.GetChildren(false).Elements;
        return (t[1].GetAttribute("innertext").Replace("\r\n", "").Replace(" \"", "\"").Replace("\" ", "\"")
                    +","+
           "\""+t[2].GetAttribute("innertext").Replace("\r\n", "")+"\""
        );
    }).ToArray());

if(table == "") return "error";
string title = System.Text.RegularExpressions.Regex.Match(instance.ActiveTab.URL, "(?<=title=)[^&]+").Value;
System.IO.File.WriteAllText(project.Directory+"\\"+title+".csv", table, Encoding.UTF8);
example of data returned
"Something About the Way You Look Tonight"/"Candle in the Wind 1997","Elton John"
"Truly Madly Deeply","Savage Garden"
"Together Again","Janet"
"Nice & Slow","Usher"
"My Heart Will Go On","Céline Dion"
"Gettin' Jiggy wit It ”,"Will Smith"
"All My Life","K-Ci & JoJo"
"Too Close","Next"
"My All","Mariah Carey"
"Too Close","Next"
"The Boy Is Mine","Brandy and Monica"
"I Don't Want to Miss a Thing","Aerosmith"
"The First Night","Monica"
"One Week","Barenaked Ladies"
"The First Night","Monica"
"Doo Wop (That Thing)","Lauryn Hill"
"Lately","Divine"
"I'm Your Angel","R. Kelly and Céline Dion"
 
Последнее редактирование:
  • Спасибо
Реакции: djljzenno

djljzenno

Client
Регистрация
26.12.2013
Сообщения
43
Благодарностей
2
Баллы
8
That works beautifully LexxWork!!

Thank you!


use links like
http://en.m.wikipedia.org/w/index.php?mobileaction=toggle_view_mobile&title=List_of_Billboard_Hot_100_number-one_singles_of_1998
this code saves all needed data to the [project directory] + "\" + title + ".csv"
C#:
string table =  string.Join("\r\n",
    instance.ActiveTab.FindElementsByXPath("//table[@class='wikitable'][2]/tbody/tr").Elements
    .Where(e=>e.GetChildren(false).Count == 4)
    .Skip(1)
    .Select<HtmlElement, string>(e=>{
        var t = e.GetChildren(false).Elements;
        return (t[1].GetAttribute("innertext").Replace("\r\n", "").Replace(" \"", "\"").Replace("\" ", "\"")
                    +","+
           "\""+t[2].GetAttribute("innertext").Replace("\r\n", "")+"\""
        );
    }).ToArray());

if(table == "") return "error";
string title = System.Text.RegularExpressions.Regex.Match(instance.ActiveTab.URL, "(?<=title=)[^&]+").Value;
System.IO.File.WriteAllText(project.Directory+"\\"+title+".csv", table, Encoding.UTF8);
example of data returned
 

djljzenno

Client
Регистрация
26.12.2013
Сообщения
43
Благодарностей
2
Баллы
8

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)