Get the first few words from a long resume (simple line or HTML)

I want to get the first few words (100 or 200) from a long word summary (simple string or html) using C #.

My requirement is to display a short description of a long summary of the content (this content can include html elements). I can get a normal string, but when it is html, elements are clipped between example, I get like this

<span style="FONT-FAMILY: Trebuchet MS">Heading</span>
</H3><span style="FONT-FAMILY: Trebuchet MS">
<font style="FONT-SIZE: 15px;

      

But it should return a string with the complete html element.

I have a Yahoo UI editor to get content from the user and I pass this text below the method to get a quick summary,

public static string GetFirstFewWords(string input, int numberWords)
{
     if (input.Split(new char[] { ' ' }, 
           StringSplitOptions.RemoveEmptyEntries).Length > numberWords)
        {
            // Number of words we still want to display.
            int words = numberWords;
            // Loop through entire summary.
            for (int i = 0; i < input.Length; i++)
            {
                // Increment words on a space.
                if (input[i] == ' ')
                {
                    words--;
                }
                // If we have no more words to display, return the substring.
                if (words == 0)
                {
                    return input.Substring(0, i);
                }
            }
            return string.Empty;
        }
        else
        {
            return input;
        }
}

      

I am trying to get the content of an article from a user and display a short summary on a list page.

+2


source to share


3 answers


Thought the Html Agility Pack is making your bets?

While not ideal, here's one idea that will get (more or less) what you want:



// retrieve a summary of html, with no less than 'max' words
string GetSummary(string html, int max)
{
    string summaryHtml = string.Empty;

    // load our html document
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);

    int wordCount = 0;


    foreach (var element in htmlDoc.DocumentNode.ChildNodes)
    {
        // inner text will strip out all html, and give us plain text
        string elementText = element.InnerText;

        // we split by space to get all the words in this element
        string[] elementWords = elementText.Split(new char[] { ' ' });

        // and if we haven't used too many words ...
        if (wordCount <= max)
        {
            // add the *outer* HTML (which will have proper 
            // html formatting for this fragment) to the summary
            summaryHtml += element.OuterHtml;

            wordCount += elementWords.Count() + 1;
        }
        else 
        { 
            break; 
        }
    }

    return summaryHtml;
}

      

+2


source


two options:



  • build the code to do it right - counting words, excluding html tags, pushing the opening tags onto the stack, and then when you hit the threshold, you pop the tags off the stack and add the closing tags to the end of the line.

    pro: full control and ability to get exactly N visible words.
    con: somewhat difficult to implement cleanly.

  • cut out the words and then feed the broken HTML into an HtmlAgilityPack (free download that can help fix broken HTML) and there you go.

    pro: almost no coding, proven solution supported by
    con: you still need to provide a way to not count tags when called.Substring()

+2


source


You must keep your content and markup separate. Can you provide more information on what you are trying to do? (like where this line comes from, why are you trying to do this).

0


source







All Articles