HtmlAgilityPack to parse content

  • Автор темы Автор темы Perfecto
  • Дата начала Дата начала

Perfecto

Client
Регистрация
06.08.2013
Сообщения
108
Реакции
9
Баллы
18
Hi,

I try to extract content with HtmlAgilityPack :
C#:
Развернуть Свернуть Копировать
using HtmlAgilityPack;

string htmlContent = project.Variables["DOM"].Value;

var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

var article = doc.DocumentNode.SelectSingleNode("//body");

// Remove all unwanted elements within the article
foreach (var node in article.SelectNodes("//*[not(starts-with(name(),'h')) and not(name()='h') and not(name()='ul') and not(name()='li') and not(name()='strong') and not(name()='b')]"))
{
    node.Remove();
}

// Print the article content with only the desired tags
string extractedContent = project.Variables["content"].Value;

I have this error :

105165


I have installed the latest version of HtmlAgilityPack net45
105166
 
  • Спасибо
Реакции: Pierre Paul Jacques
Remove using HtmlAgilityPack;
Put it in the general code, Using tab

Посмотреть вложение 105167
Thanks it work

there is a problem with my code:

C#:
Развернуть Свернуть Копировать
string htmlContent = project.Variables["DOM"].Value;

var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

if (doc.DocumentNode != null)
{
    // Find the element containing the article
    // In this example, we assume the article is contained within the <body> tag
    var article = doc.DocumentNode.SelectSingleNode("//body");

    if (article != null)
    {
        // Remove all unwanted elements within the article
        var nodesToRemove = article.SelectNodes("//*[not(starts-with(name(),'h')) and not(name()='h') and not(name()='ul') and not(name()='li') and not(name()='strong') and not(name()='b') and not(name()='p')]");

        if (nodesToRemove != null)
        {
            foreach (var node in nodesToRemove)
            {
                node.Remove();
            }
        }

        // Remove HTML comments
        var comments = article.SelectNodes("//comment()");
        if (comments != null)
        {
            foreach (var comment in comments)
            {
                comment.Remove();
            }
        }

        // Replace <p> tags with their inner text
        var pTags = article.SelectNodes("//p");
        if (pTags != null)
        {
            foreach (var pTag in pTags)
            {
                if (pTag.ParentNode != null)
                {
                    pTag.ParentNode.InsertBefore(HtmlTextNode.CreateNode(pTag.InnerText), pTag);
                    pTag.Remove();
                }
            }
        }

        // Store the article content with only the desired tags into the ZennoPoster variable
        project.Variables["clean_content"].Value = article.InnerHtml;
    }
    else
    {
        // if <body> not found
        project.Variables["clean_content"].Value = "No body";
    }
}
else
{
    // if DocumentNode is null
    project.Variables["clean_content"].Value = "DocumentNode is null";
}

My goal is to extract the HTML pages from different sites and clean it up from the HTML while keeping :
Hn, stong, b ,ul, li tags and their content
The content of the <p> tags but without the tags.
And by removing the HTML comments

The result is not the expected one but I can't understand why...
 
Thank you for your quick response.
It returns the same thing as in the "clean_content" variable
In my exemple I took this page : https://www.lavieclaire.com/conseils/quels-sont-les-bienfaits-de-la-spiruline/
And the result is :


C#:
Развернуть Свернуть Копировать
<!-- Google Tag Manager (noscript) -->

<!-- End Google Tag Manager (noscript) -->

    
    

        
        
                

        <!-- Cookie Axeptio -->
        
        <!-- End Cookie Axeptio -->
 
Hi i got the same trouble with this pack,)

Maybe i missed something?

Thank by advance

117893
117894
117895
 

Кто просматривает тему: (Всего: 0, Пользователи: 0, Гости: 0)