HtmlAgilityPack to parse content

Perfecto

Client
Регистрация
06.08.2013
Сообщения
94
Благодарностей
5
Баллы
8
Hi,

I try to extract content with HtmlAgilityPack :
C#:
using HtmlAgilityPack;

string htmlContent = project.Variables["DOM"].Value;

var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

var article = doc.DocumentNode.SelectSingleNode("//body");

// Remove all unwanted elements within the article
foreach (var node in article.SelectNodes("//*[not(starts-with(name(),'h')) and not(name()='h') and not(name()='ul') and not(name()='li') and not(name()='strong') and not(name()='b')]"))
{
    node.Remove();
}

// Print the article content with only the desired tags
string extractedContent = project.Variables["content"].Value;
I have this error :

105165


I have installed the latest version of HtmlAgilityPack net45
105166
 
  • Спасибо
Реакции: Pierre Paul Jacques

Phoenix78

Client
Read only
Регистрация
06.11.2018
Сообщения
11 790
Благодарностей
5 720
Баллы
113

Perfecto

Client
Регистрация
06.08.2013
Сообщения
94
Благодарностей
5
Баллы
8
Remove using HtmlAgilityPack;
Put it in the general code, Using tab

Посмотреть вложение 105167
Thanks it work

there is a problem with my code:

C#:
string htmlContent = project.Variables["DOM"].Value;

var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

if (doc.DocumentNode != null)
{
    // Find the element containing the article
    // In this example, we assume the article is contained within the <body> tag
    var article = doc.DocumentNode.SelectSingleNode("//body");

    if (article != null)
    {
        // Remove all unwanted elements within the article
        var nodesToRemove = article.SelectNodes("//*[not(starts-with(name(),'h')) and not(name()='h') and not(name()='ul') and not(name()='li') and not(name()='strong') and not(name()='b') and not(name()='p')]");

        if (nodesToRemove != null)
        {
            foreach (var node in nodesToRemove)
            {
                node.Remove();
            }
        }

        // Remove HTML comments
        var comments = article.SelectNodes("//comment()");
        if (comments != null)
        {
            foreach (var comment in comments)
            {
                comment.Remove();
            }
        }

        // Replace <p> tags with their inner text
        var pTags = article.SelectNodes("//p");
        if (pTags != null)
        {
            foreach (var pTag in pTags)
            {
                if (pTag.ParentNode != null)
                {
                    pTag.ParentNode.InsertBefore(HtmlTextNode.CreateNode(pTag.InnerText), pTag);
                    pTag.Remove();
                }
            }
        }

        // Store the article content with only the desired tags into the ZennoPoster variable
        project.Variables["clean_content"].Value = article.InnerHtml;
    }
    else
    {
        // if <body> not found
        project.Variables["clean_content"].Value = "No body";
    }
}
else
{
    // if DocumentNode is null
    project.Variables["clean_content"].Value = "DocumentNode is null";
}
My goal is to extract the HTML pages from different sites and clean it up from the HTML while keeping :
Hn, stong, b ,ul, li tags and their content
The content of the <p> tags but without the tags.
And by removing the HTML comments

The result is not the expected one but I can't understand why...
 

lokiys

Moderator
Регистрация
01.02.2012
Сообщения
4 812
Благодарностей
1 187
Баллы
113
Use
C#:
project.SendInfoToLog("Your comment or data", false);
in your code and test what values are returned and fix your code.
 

Perfecto

Client
Регистрация
06.08.2013
Сообщения
94
Благодарностей
5
Баллы
8

Pierre Paul Jacques

Активный пользователь
Регистрация
08.10.2023
Сообщения
134
Благодарностей
35
Баллы
28
Hi i got the same trouble with this pack,)

Maybe i missed something?

Thank by advance

117893
117894
117895
 

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)