How to Parse data from table..

djljzenno

Client
Регистрация
26.12.2013
Сообщения
43
Благодарностей
2
Баллы
8
Hi

I would like to parse the artist,song data from this table. Having trouble as there are many links on the page messing it up. Tried all kind of regex... even doing just the page.text and I am not able to figure this out.

Here is a page in question
http://en.wikipedia.org/wiki/List_of_Billboard_Hot_100_number-one_singles_of_2013

I would like to grab the data like this...

Song;Artist

Then get the next one in the table. Get all on the table. I will then go to the next page and do the same.

Thank you!

LJ
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
regex the table off the page first something like

(?<=<table\ class="wikitable\ plainrowheaders">)[\w\W]*?(?=</table>)

Then you can get the titles of the songs and artists out of that using regular regex.
 

djljzenno

Client
Регистрация
26.12.2013
Сообщения
43
Благодарностей
2
Баллы
8
Ok.. that helped. Just trying to figure this out now.

Regex I have so far to grab data...

(?<=rowspan="4"\ style="text-align:\ center;">"*?<a\ title=").*(?="\ href=")

Example data
<td bgcolor="#FFFF99" rowspan="4" style="text-align: center;">"<a title="Thrift Shop" href="/wiki/Thrift_Shop">Thrift Shop</a>" <img height="14" width="9" src="//upload.wikimedia.org/wikipedia/commons/3/37/Dagger-14-plain.png" alt="dagger" /></td>
<td rowspan="4" style="text-align: center;"><a title="Macklemore" href="/wiki/Macklemore">Macklemore</a> &amp; <a title="Ryan Lewis" href="/wiki/Ryan_Lewis">Ryan Lewis</a> featuring <a title="Wanz" href="/wiki/Wanz">Wanz</a></td>
<td style="text-align: center;"><sup class="reference" id="cite_ref-19"><a href="#cite_note-19"><span>[</span>17<span>]</span></a></sup></td>

Collects...
Thrift Shop
Macklemore" href="/wiki/Macklemore">Macklemore</a> &amp; <a title="Ryan Lewis" href="/wiki/Ryan_Lewis">Ryan Lewis</a> featuring <a title="Wanz

I can't figure out how to cut everything off after Macklemore ... I only want the first artist.
And... how to put this into a table format instead of list.

Like...

Thrift Shop;Macklemore

Thank you!!

LJ
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
Код:
https://www.dropbox.com/s/jfmswta9bgl0wqt/wikiBillboard.xmlz
 
  • Спасибо
Реакции: djljzenno

djljzenno

Client
Регистрация
26.12.2013
Сообщения
43
Благодарностей
2
Баллы
8

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
That's what I'm here for. :bk:
 

djljzenno

Client
Регистрация
26.12.2013
Сообщения
43
Благодарностей
2
Баллы
8
Ok bigcajones... scraped about 25 pages and hit an issue I can't figure out.

I updated your regex to.... well, if you don't mind looking at the project.

This was to accommodate variants in the tables.

I changed
(?<=rowspan="\d+.*">).+

to
(?<=\ align="center">).*</a>.?</td>

because rowspan is not always present but align is.

This refined it a little and caught ones it was missing.

Now it is still not picking up the artist found below when it is not in a <a> tag. It is just plain text. I can't figure out how to get it. With this issue it is messing up the file making the artist/songs inaccurate because it is one step off.

<td align="center">May 17</td>
<td rowspan="3" align="center">"<a href="/wiki/The_Greatest_Love_of_All" title="The Greatest Love of All">Greatest Love of All</a>"</td>
<td rowspan="3" align="center">Whitney Houston</td>
<td align="center"><sup id="cite_ref-19" class="reference"><a href="#cite_note-19"><span>[</span>19<span>]</span></a></sup><sup id="cite_ref-20" class="reference"><a href="#cite_note-20"><span>[</span>20<span>]</span></a></sup></td>

Found on page .. http://en.wikipedia.org/wiki/List_of_Billboard_Hot_100_number-one_singles_of_1986

Thanks for all your help.

LJ
 

Вложения

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
I tested with my regex that I gave you orginally from the years 1959-2014. What I found that it works great except in the year 2009 the Black Eye Peas had 2 songs in the same weeks so that is where it went haywire. Also it looks like 2010 it took some elements that you should try to get rid of such as [25] etc which is easy enough to check for.
 

djljzenno

Client
Регистрация
26.12.2013
Сообщения
43
Благодарностей
2
Баллы
8
Yes.. you were right!

I put your regex back in there and added a check in for the numbers to take them out. 2009 is the only only goofing up.

Thanks!
 

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)