Quantcast
Channel: Eric White's Blog
Viewing all 35 articles
Browse latest View live

Retrieving the Default Style Name of an Open XML WordprocessingML Document

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

This is one in a series of posts on transforming Open XML WordprocessingML to XHtml.  You can find the complete list of posts here.

Whenever I write some Open XML SDK code that processes paragraphs based on style name, I need to retrieve the default style name for a document.  It is pretty easy to do, but it always takes a small bit of time to remember / lookup the element and attribute names.  Posting this code here so that I can save time next time I need to do this...

This code uses the utility types (preatomized XName objects and the GetXDocument extension method) in PtOpenXmlUtil.cs, which you can find in HtmlConverter.zip under the downloads section at CodePlex.com/PowerTools.

using System;
using System.Collections.Generic;
using System.IO;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using OpenXmlPowerTools;

classProgram
{
    staticvoid Main(string[] args)
    {
        byte[] byteArray = File.ReadAllBytes("Test.docx");
        using (MemoryStream memoryStream = newMemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument wordDoc =
                WordprocessingDocument.Open(memoryStream, true))
            {
                string defaultParagraphStyleId = wordDoc
                    .MainDocumentPart
                    .StyleDefinitionsPart
                    .GetXDocument()
                    .Root
                    .Elements(W.style)
                    .Where(e => (string)e.Attribute(W.type) == "paragraph"&&
                        (string)e.Attribute(W._default) == "1")
                    .Select(s => (string)s.Attribute(W.styleId))
                    .FirstOrDefault();
                Console.WriteLine(defaultParagraphStyleId);
            }
        }
    }
}


ListItemRetriever: Accurately Retrieving Text of a Open XML WordprocessingML Paragraph

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

This is one in a series of posts on transforming Open XML WordprocessingML to XHtml.  You can find the complete list of posts here.

When you are retrieving the text of an Open XML WordprocessingML paragraph, it is often pretty important to retrieve the text of a list item.  This was especially true for the WordprocessingML => XHtml transform.  This post introduces the ListItemRetriever class, which implements one aspect of the functionality in HtmlConverter to retrieve the entire text of a paragraph accurately.  ListItemRetriever is stand-alone – you can use it in a variety of contexts other than HtmlConverter.

It is important to understand exactly what a list item is.  Numbered and bulleted lists are made up of two components – the list item, and the paragraph text:

The same distinction applies to a numbered list:

The blog post, Working with Numbering in Open XML WordprocessingML, describes the markup for list items.  That blog post has also been published as an MSDN article, which you can find here.  The ListItemRetriever class implements the algorithms that are necessary to process the markup described in that post.

The use of ListItemRetriever is super simple.  The class consists of one public static method with the following signature:

publicstaticstring RetrieveListItem(WordprocessingDocument wordDoc,
    XElement paragraph, string bulletReplacementString)

To retrieve a string that contains the list item text, you pass an open WordprocessingML document, a LINQ to XML XElement object, and an optional bullet replacement string.  Following is an example that shows the use of ListItemRetriever.RetrieveListItem:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Packaging;

classProgram
{
    staticvoid OutputBlockLevelContent(WordprocessingDocument wordDoc,
        XElement blockLevelContentContainer)
    {
        foreach (XElement blockLevelContentElement in
            blockLevelContentContainer.LogicalChildrenContent())
        {
            if (blockLevelContentElement.Name == W.p)
            {
                string listItem = ListItemRetriever.RetrieveListItem(
                    wordDoc, blockLevelContentElement, "-");
                string text = blockLevelContentElement
                    .LogicalChildrenContent(W.r)
                    .LogicalChildrenContent(W.t)
                    .Select(t => (string)t)
                    .StringConcatenate();
                Console.WriteLine("Paragraph text >{0}<", listItem + text);
                continue;
            }
            // If element is not a paragraph, it must be a table.
            Console.WriteLine();
            Console.WriteLine("Table");
            Console.WriteLine("=====");
            foreach (var row in blockLevelContentElement.LogicalChildrenContent())
            {
                Console.WriteLine();
                Console.WriteLine("Row");
                Console.WriteLine("===");
                foreach (var cell in row.LogicalChildrenContent())
                {
                    Console.WriteLine();
                    Console.WriteLine("Cell");
                    Console.WriteLine("====");
                    // Cells are a block-level content container, so can call this
                    // method recursively.
                    OutputBlockLevelContent(wordDoc, cell);
                    Console.WriteLine();
                }
            }
        }
    }

    staticvoid Main(string[] args)
    {
        byte[] byteArray = File.ReadAllBytes("Test.docx");
        using (MemoryStream memoryStream = newMemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument wordDoc =
                WordprocessingDocument.Open(memoryStream, true))
            {
                RevisionAccepter.AcceptRevisions(wordDoc);
                XElement root = wordDoc.MainDocumentPart.GetXDocument().Root;
                XElement body = root.LogicalChildrenContent().First();
                OutputBlockLevelContent(wordDoc, body);
            }
        }
    }
}

The following document contains a number of paragraphs that are part of numbered or bulleted lists.  It includes a table with one row and two cells:

When you run the example program for this document, you see:

Paragraph text >First. This is a test.<
Paragraph text >Second. This is another paragraph.<
Paragraph text >Third. Third paragraph is here.<
Paragraph text ><
Paragraph text >I. This list has roman numerals.<
Paragraph text >II. Et tu, Bruté?<
Paragraph text >III. This is a third paragraph.<
Paragraph text >IV. This is a fourth paragraph.<
Paragraph text >V. This is a fifth paragraph.<
Paragraph text ><

Table
=====

Row
===

Cell
====
Paragraph text >1st. This is a numbered list in a cell.<
Paragraph text >2nd. Another item.<
Paragraph text >3rd. A third item.<


Cell
====
Paragraph text >- Here is a bulleted list.<
Paragraph text >- Another item.<

Paragraph text ><

In this example, the OutputBlockLevelContent method needs to be recursive, as table cells can contain other tables.

In this example, I passed a hyphen (-) to the bulletReplacementString argument of ListItemRetriever.RetrieveListItem, so that it is easy to print the string.  Alternatively, you can pass null for the bulletReplacementString argument, in which case the returned string contains the Unicode characters for bullets.  There are a number of bullets that you can use for a bulleted list.

The ListItemRetriever depends on the document containing no tracked revisions.  This example uses the approach of reading a document into a byte array, then creating a resizable memory stream from the byte array, and then opening the WordprocessingML document from the memory stream.  Using this approach, the example is free to accept revisions, and subsequently process the document without affecting the document stored on disk.  I introduced this approach in Simplifying Open XML WordprocessingML Queries by First Accepting Revisions.

This example uses the LogicalChildrenContent axis method, which I introduced in Mastering Text in Open XML Word-Processing Documents.  By first accepting revisions, and by using the LogicalChildrenContent axes, this example will return the correct text regardless of whether the source document contains revisions, content controls, smart tags, or any of the other interesting artifacts of WordprocessingML that can make processing WordprocessingML content more challenging.

ListItemRetriever class is part of the PowerTools for Open XML project.  You can find ListItemRetriever.cs in the HtmlConverter.zip download, under the 'Downloads' tab at PowerTools for Open XML.

PowerTools for Open XML also includes the PtOpenXmlUtil.cs module, which includes the LogicalChildrenContent axis methods.

This example uses the StringConcatenate extension method.  The module PtUtil.cs in the PowerTools for Open XML project contains a number of my favorite functional programming extension methods, including StringConcatenate.

Transforming WordprocessingML to Simpler XML for Easier Processing

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

This is one in a series of posts on transforming Open XML WordprocessingML to XHtml.  You can find the complete list of posts here.

When building a document processing system based on Open XML WordprocessingML, one approach to making your software system more robust is to use a technique where you make use of WordprocessingML style information to first transform a WordprocessingML document to a simpler form of XML.  You then can write code to further process or query the simpler XML.  Because your code operates on the simpler XML document, it is easier to validate that your code is correct.  You stand a better chance that your software will have the desired behavior in more circumstances.  This post presents one approach to transforming and processing a WordprocessingML document.

The following screen-shot uses a feature of Microsoft Word that displays the style name for every paragraph to the paragraph's left.  Your document might look like the following:

The WordprocessingML markup for the document looks like this:

<w:pw:rsidR="00DD5B8D"
     w:rsidRDefault="00D56D15"
     w:rsidP="00D56D15">
  <w:pPr>
    <w:pStylew:val="Heading1"/>
  </w:pPr>
  <w:r>
    <w:t>Introduction to WordprocessingML</w:t>
  </w:r>
</w:p>
<w:sdt>
  <w:sdtPr>
    <w:aliasw:val="Overview"/>
    <w:tagw:val="Overview"/>
    <w:idw:val="17452686"/>
    <w:placeholder>
      <w:docPartw:val="DefaultPlaceholder_22675703"/>
    </w:placeholder>
  </w:sdtPr>
  <w:sdtEndPr>
    <w:rPr>
      <w:rFontsw:asciiTheme="minorHAnsi"
                w:eastAsiaTheme="minorHAnsi"
                w:hAnsiTheme="minorHAnsi"
                w:cstheme="minorBidi"/>
      <w:bw:val="0"/>
      <w:bCsw:val="0"/>
      <w:colorw:val="auto"/>
      <w:szw:val="22"/>
      <w:szCsw:val="22"/>
    </w:rPr>
  </w:sdtEndPr>
  <w:sdtContent>
    <w:pw:rsidR="00D56D15"
         w:rsidRDefault="00D56D15"
         w:rsidP="00D56D15">
      <w:pPr>
        <w:pStylew:val="Heading2"/>
      </w:pPr>
      <w:r>
        <w:t>Overview</w:t>
      </w:r>
    </w:p>
    <w:pw:rsidR="00D56D15"
         w:rsidRDefault="00D56D15">
      <w:r>
        <w:t>On the Insert tab, the galleries include items.</w:t>
      </w:r>
    </w:p>
  </w:sdtContent>
</w:sdt>
  <!-- some content elided to simplify the listing -->
</w:sdt>

You could transform this markup to something along the lines of the following, which is easier to further process.

<z:documentxmlns:z="http://www.adventureworks.com/sample">
  <z:pstyle="Heading1">Introduction to WordprocessingML</z:p>
  <z:contentControltag="Overview">
    <z:pstyle="Heading2">Overview</z:p>
    <z:pstyle="Normal">On the Insert tab, the galleries include items.</z:p>
  </z:contentControl>
  <z:contentControltag="Section">
    <z:pstyle="Heading2">Next Section</z:p>
    <z:pstyle="Normal">You can use these galleries to insert tables.</z:p>
  </z:contentControl>
</z:document>

The first step in writing a transform to simpler XML is to accept tracked changes.  With these types of transforms, where you first transform valid WordprocessingML to another simpler form of valid WordprocessingML, you probably want to use a technique where you make all changes to a temporary in-memory document.  You probably do not want to write the simpler XML back to the original source document.  Simplifying Open XML WordprocessingML Queries by First Accepting Revisions presents the simplest approach for doing this.

The second step is to further process the document (which now contains no tracked changes), and remove features of Open XML that are not interesting to your transform.  Enabling Better Transformations by Simplifying Open XML WordprocessingML Markup introduces the MarkupSimplifier class, which is part of the PowerTools for Open XML project on CodePlex.  That class makes it very easy to reliably remove unused features of WordprocessingML.

One more important tool in your toolbox is to use the LogicalChildrenContent axis method.  Mastering Text in Open XML Word-Processing Documents introduces the LogicalChildrenContent axis method, and explains when you want to use it.

A transform from WordprocessingML to a simpler XML vocabulary is a document-centric transform.  The blog post Document-Centric Transforms using LINQ to XML explores the nature of these types of transforms.  XSLT is a common tool for writing these types of transforms.  It is designed with the express purpose of building them.  Recursive Approach to Pure Functional Transformations of  XML introduces one approach for writing these types of transforms using C# 3.0.  This approach allows you to write an absolute minimum of C# code to transform the XML.

The following is a complete listing of the transform that produces the simpler XML shown at the beginning of this post.  You can see that it doesn't take much code to do the transform – less than 90 lines of code for the entire example, and only 31 lines of code for the transform function.  This example can be found in the HtmlConverter.zip download under the Downloads tab at PowerTools for Open XML.  This code will work properly regardless of whether the WordprocessingML markup contains revisions, smart tags, or any of the other interesting features of WordprocessingML.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using OpenXmlPowerTools;

classProgram
{
    staticXNamespace Z = "http://www.adventureworks.com/sample";

    staticobject TransformToSimpleXml(XNode node, string defaultParagraphStyleId)
    {
        XElement element = node asXElement;
        if (element != null)
        {
            if (element.Name == W.document)
                returnnewXElement(Z + "document",
                    newXAttribute(XNamespace.Xmlns + "z", Z),
                    element.Element(W.body).Elements()
                        .Select(e => TransformToSimpleXml(e, defaultParagraphStyleId)));
            if (element.Name == W.p)
            {
                string styleId = (string)element.Elements(W.pPr)
                    .Elements(W.pStyle).Attributes(W.val).FirstOrDefault();
                if (styleId == null)
                    styleId = defaultParagraphStyleId;
                returnnewXElement(Z + "p",
                    newXAttribute("style", styleId),
                    element.LogicalChildrenContent(W.r).Elements(W.t).Select(t => (string)t)
                        .StringConcatenate());
            }
            if (element.Name == W.sdt)
                returnnewXElement(Z + "contentControl",
                    newXAttribute("tag", (string)element.Elements(W.sdtPr)
                        .Elements(W.tag).Attributes(W.val).FirstOrDefault()),
                    element.Elements(W.sdtContent).Elements()
                        .Select(e => TransformToSimpleXml(e, defaultParagraphStyleId)));
            returnnull;
        }
        return node;
    }

    staticvoid Main(string[] args)
    {
        byte[] byteArray = File.ReadAllBytes("Test.docx");
        using (MemoryStream memoryStream = newMemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument wordDoc =
                WordprocessingDocument.Open(memoryStream, true))
            {
                RevisionAccepter.AcceptRevisions(wordDoc);
                SimplifyMarkupSettings settings = newSimplifyMarkupSettings
                {
                    RemoveComments = true,
                    RemoveContentControls = false,
                    RemoveEndAndFootNotes = true,
                    RemoveFieldCodes = true,
                    RemoveLastRenderedPageBreak = true,
                    RemovePermissions = true,
                    RemoveProof = true,
                    RemoveRsidInfo = true,
                    RemoveSmartTags = true,
                    RemoveSoftHyphens = true,
                    ReplaceTabsWithSpaces = true,
                };
                MarkupSimplifier.SimplifyMarkup(wordDoc, settings);
                string defaultParagraphStyleId = wordDoc.MainDocumentPart
                    .StyleDefinitionsPart.GetXDocument().Root.Elements(W.style)
                    .Where(e => (string)e.Attribute(W.type) == "paragraph"&&
                        (string)e.Attribute(W._default) == "1")
                    .Select(s => (string)s.Attribute(W.styleId))
                    .FirstOrDefault();
                XElement simplerXml = (XElement)TransformToSimpleXml(
                    wordDoc.MainDocumentPart.GetXDocument().Root,
                    defaultParagraphStyleId);
                Console.WriteLine(simplerXml);
            }
        }
    }
}

Validate Open XML Documents using the Open XML SDK 2.0

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

Open XML developers create new documents in a variety of ways – either through transforming from an existing document to a new one, or by programmatically altering an existing document and saving it back to disk.  It is valuable to use the Open XML SDK 2.0 to determine if the new or altered document, spreadsheet, or presentation contains invalid markup.

This was particularly useful when I was writing the code to accept tracked revisions, and the Open XML WordprocessingML markup simplifier.  I wrote a small program to iterate through all documents in a directory tree and programmatically alter or transform each document, and then validate.  This allowed me to run the code on thousands of documents, making sure that the code would not create invalid documents.

The use of the validator is simple:

  • Open your document/spreadsheet/presentation as usual using the Open XML SDK.
  • Instantiate an OpenXmlValidator object (from the DocumentFormat.OpenXml.Validation namespace).
  • Call the OpenXmlValidator.Validate method, passing the open document.  This method returns a collection of ValidationErrorInfo objects.  If the collection is empty, then the document is valid.  You can validate before and after modifying the document.

Here is the simplest code to validate a document.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Validation;
using DocumentFormat.OpenXml.Wordprocessing;

classProgram
{
    staticvoid Main(string[] args)
    {
        using (WordprocessingDocument wordDoc =
            WordprocessingDocument.Open("Test.docx", false))
        {
            OpenXmlValidator validator = newOpenXmlValidator();
            var errors = validator.Validate(wordDoc);
            if (errors.Count() == 0)
                Console.WriteLine("Document is valid");
            else
                Console.WriteLine("Document is not valid");
        }
    }
}

While debugging your code, it is helpful to know exactly where each error is.  You can iterate through the errors, printing:

  • The content type for the part that contains the error.
  • An XPath expression that identifies the element that caused the error.
  • An error message.

Here is code to do that:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Validation;
using DocumentFormat.OpenXml.Wordprocessing;

classProgram
{
    staticvoid Main(string[] args)
    {
        using (WordprocessingDocument wordDoc =
            WordprocessingDocument.Open("Test.docx", false))
        {
            OpenXmlValidator validator = newOpenXmlValidator();
            var errors = validator.Validate(wordDoc);
            if (errors.Count() == 0)
                Console.WriteLine("Document is valid");
            else
                Console.WriteLine("Document is not valid");
            Console.WriteLine();
            foreach (var error in errors)
            {
                Console.WriteLine("Error description: {0}", error.Description);
                Console.WriteLine("Content type of part with error: {0}",
                    error.Part.ContentType);
                Console.WriteLine("Location of error: {0}", error.Path.XPath);
            }
        }
    }
}

As a developer, you will want to open a document, modify it in some fashion, and then validate that your modifications were correct.  The following example opens a document for writing, modifies it to make it invalid, and then validates.  To make an invalid document, it adds a text element (w:t) as a child element of a paragraph (w:p) instead of a run (w:r).

This approach to document validation works if you are using the Open XML SDK strongly-typed object model.  It also works if you are using another XML programming technology, such as LINQ to XML.  The following example shows the document modification code written using two approaches.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Validation;
using DocumentFormat.OpenXml.Wordprocessing;

publicstaticclassMyExtensions
{
    publicstaticXDocument GetXDocument(thisOpenXmlPart part)
    {
        XDocument partXDocument = part.Annotation<XDocument>();
        if (partXDocument != null)
            return partXDocument;
        using (Stream partStream = part.GetStream())
        using (XmlReader partXmlReader = XmlReader.Create(partStream))
            partXDocument = XDocument.Load(partXmlReader);
        part.AddAnnotation(partXDocument);
        return partXDocument;
    }

    publicstaticvoid PutXDocument(thisOpenXmlPart part)
    {
        XDocument partXDocument = part.GetXDocument();
        if (partXDocument != null)
        {
            using (Stream partStream = part.GetStream(FileMode.Create, FileAccess.Write))
            using (XmlWriter partXmlWriter = XmlWriter.Create(partStream))
                partXDocument.Save(partXmlWriter);
        }
    }
}

classProgram
{
    staticvoid Main(string[] args)
    {
        using (WordprocessingDocument wordDoc =
            WordprocessingDocument.Open("Test.docx", true))
        {
            // Open XML SDK strongly-typed object model code that modifies a document,
            // making it invalid.
            wordDoc.MainDocumentPart.Document.Body.InsertAt(
                newParagraph(
                    newText("Test")), 0);

            // LINQ to XML code that modifies a document, making it invalid.
            XDocument d = wordDoc.MainDocumentPart.GetXDocument();
            XNamespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
            d.Descendants(w + "body").First().AddFirst(
                newXElement(w + "p",
                    newXElement(w + "t", "Test")));
            wordDoc.MainDocumentPart.PutXDocument();

            OpenXmlValidator validator = newOpenXmlValidator();
            var errors = validator.Validate(wordDoc);
            if (errors.Count() == 0)
                Console.WriteLine("Document is valid");
            else
                Console.WriteLine("Document is not valid");
            Console.WriteLine();
            foreach (var error in errors)
            {
                Console.WriteLine("Error description: {0}", error.Description);
                Console.WriteLine("Content type of part with error: {0}",
                    error.Part.ContentType);
                Console.WriteLine("Location of error: {0}", error.Path.XPath);
            }
        }
    }
}

When you run this example, it produces the following output:


Document is not valid

Error description: The element has invalid child element
  'http://schemas.openxmlformats.org/wordprocessingml/2006/main:t'.
  List of possible elements expected:
    <http://schemas.openxmlformats.org/wordprocessingml/2006/main:pPr>.
Content type of part with error:
  application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
Location of error: /w:document[1]/w:body[1]/w:p[1]

Reducing Connaissance (Interconnectedness) and Increasing Robustness using LINQ

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

This is one in a series of posts on transforming Open XML WordprocessingML to XHtml.  You can find the complete list of posts here.

Recently I went through the process of handling a variety of details in the XHtml converter– generating XHtml entities wherever possible, and making sure that whatever else, the XHtml converter doesn't throw exceptions, regardless of the documents that you throw at it (including invalid documents, to a certain extent).  As I was doing so, I touched various pieces of the code, here and there, and I was struck by a couple of interesting (and good) dynamics of maintaining the code.  First, changes to code are localized, and second, it's easy to write resilient code.

First, about the code: there are around 3500 lines of code in six modules.  The code is written in the pure functional style.  Variables and class members are not mutated after creation/initialization.  If a variable or object is in scope, we can depend on its value not changing.

This means that for the most part, all of my code changes were local.  This only makes sense – if you are writing pure functions/methods, then by definition, you will not be touching data outside the function/method.

The WordprocessingML => XHtml converter is written as a series of successive transformations.  So long as I continue to produce intermediate results that are consistent with the design, I'm free to modify code as much as I like.  The code is malleable, not brittle.  I can as necessary validate that the transform is producing valid Open XML documents.

In Meiller Page-Jones's books on object-oriented design, he coins the term connaissance for the notion of interconnectedness of code.  For example, a class has a member function that is used in a variety of other modules.  A change in the name or signature of the function requires all uses of it to be updated.  This is a variety of interconnectedness that the compiler catches.  Another example: a class has member that has some odd-ball semantics, and another class relies on those odd-ball semantics, and if you change the behavior of the method without changing its signature, the compiler won't notice.  Magic values are a great example of a horrible form of connaissance.

Magic values are a design 'feature' of some programming interfaces where values from two value domains are used in the same variable or collection.  For example, you could have an array of integers where the integers 0-9999 are indexes into a data structure, but -1 and -2 have special meaning.  It is better to split the semantic information into multiple fields of a named or anonymous type.  If a value is an index, then it should always be an index.

In a couple of places in the XHtml transform I have written code that is locally impure but globally pure because the code is (much) more readable when written that way.  One good example of this is when accepting tracked revisions for content controls.  The semantics of this transform are pretty involved – a block-level content control may need to be demoted to a run-level content control when processing a tracked deleted paragraph mark.  Pure functional code would require that the revised content control be created conditionally at both the block-level and run-level, and that content of the each conditionally created content control would need to be created conditionally.  Due to the involved semantics, the resulting pure functional code would be a mess.  When I make the decision to write locally impure code like that, I don’t take it lightly, and I make sure that I’m able to articulate why I’m writing imperative (not declarative) code.  There are times and places when it’s appropriate to go to any length to write pure functional code without side effects – if you MUST take advantage of multiple cores in specific ways, or if you are writing a system that can relink while running.  However, when programming with Open XML, these reasons don’t apply.  For one thing, I’ve yet to create an Open XML processing system that isn’t IO bound, so using multiple cores won’t help us.  And the second scenario is applicable in telephone switching systems (think erlang).  Instead, I use functional code to reduce line count (by an order of magnitude in some cases), and easier debugging.

Benefits of Pure Code (Functions with no Side-Effects)

There are a very few places in the XHtml conversion where there is interconnected code that cross module boundaries.  For example, two classes (MarkupSimplifier and HtmlConverter) each have a settings class (MarkupSimplifierSettings and HtmlConverterSettings).  If I change a member in one of these classes, then I probably need to touch other modules.  But these are mitigated in that the compiler catches issues.  In addition, in a few places in the code, a query 'projects' a collection of some type.  Other transformations take that collection and transform to other shapes.  If I change the definition of that type, then I need to touch other pieces of code.  These changes are relatively local, and again, the compiler will catch problems (so long as you don't use magic values).

In code that is written in a functional style, situations where the compiler will not catch issues of interconnectedness are vastly reduced.  In many cases, a transform takes a valid WordprocessingML document and produces another valid WordprocessingML document (which is easy to validate).  So long as I've satisfied those requirements, change happens locally, and the code is amenable to change.

Using LINQ to write Resilient Code

Before discussing this issue, I want to articulate a couple of goals of my WordprocessingML => XHtml converter.

·        First, as much as reasonably possible, it shouldn't throw exceptions, even if you give the converter an invalid document.

·        Second, generate *some* XHtml that is reasonable for the document.  Of course, the first goal is to generate accurate XHtml, but if there is some case that I haven't handled, or if you send an invalid document through the converter, then at least generate something reasonable.

I've tested the code on a fairly wide variety of documents, including many that were generated by applications other than Word.  And sure enough, some of those are invalid.  In many cases, Word is pretty good at opening invalid documents, and one of my goals is to make the XHtml converter at least as accepting as Word.

To meet this goal, wherever possible, I wrote code using the approach that I detailed in Querying for Optional Elements and Attributes.  Sure, it does a small bit of extra frictional work, but the end result in robustness makes it worth it.

Testing the WordprocessingML => XHtml Converter

To test the WordprocessingML to XHtml converter, first I put together a set of test documents that would test all cases that I want to cover.  Those test documents cause code coverage over all lines of code.  Once that set of document could be converted to XHtml to my satisfaction, I ran the code over my collection of Open XML WordprocessingML documents.  Over time, I’ve assembled a fairly large collection of documents – around 25,000.  Running the code over this larger set of documents illuminated issues associated with edge-case documents and invalid documents.  After all my specifically designed documents were converted to XHtml properly, and after running the converter over the larger set of sample documents without encountering exceptions, I could release the code with a fair amount of confidence in its quality.

Formats Supported for altChunk

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

The altChunk importing functionality of Word supports the following formats for the imported content:

AlternativeFormatImportPartType enum

Content Type

Html

text/html

Mht

message/rfc822

OfficeWordMacroEnabled

application/vnd.ms-word.document.macroEnabled.main+xml

OfficeWordMacroEnabledTemplate

application/vnd.ms-word.template.macroEnabledTemplate.main+xml

OfficeWordTemplate

application/vnd.openxmlformats-officedocument.wordprocessingml.template.main+xml

Rtf

application/rtf

TextPlain

text/plain

WordprocessingML

application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml

XHtml

application/xhtml+xml

Xml   *

application/xml

 

* This content type imports Office 2003 Word XML format (schemas) and the Office 2007 flat OPC format.

For a basic example that shows how to use altChunk, see How to use altChunk for Document Assembly.

Another approach for assembling a document is to use DocumentBuilder.  See Comparison of altChunk to the DocumentBuilder Class for more info.

You can construct and insert rich content that contains images.  For more info, see Inserting Content that Contains Images using altChunk.

Release of the Open XML SDK 2.0 for Microsoft Office

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

Microsoft has released the RTM version ofOpen XML SDK 2.0 for Microsoft Office today.  This is great news and another big step forward for developers who write software systems that read and write Open XML documents.  Till now, we’ve been working with Community Technology Previews (CTP) that had licensing restrictions (you couldn’t ‘go live’ with solutions).  We now have a stable platform on which we can build the next generation of Open XML solutions.

As with the CTPs, the RTM version of the Open XML SDK consists of two principle components:

  • A .NET managed class library that provides capabilities for reading, writing, modifying, and validating Open XML documents.
  • A productivity tool that includes the ability to diff Open XML documents, a C# code generator, and tools to explore and read about the class library and the standard.

About the Library

Some of the key characteristics of the library are:

  • You can use a powerful functional programming approach to write applications that generate documents, spreadsheets, and presentations.
  • You can use Language Integrated Query (LINQ) to retrieve data and content from documents, spreadsheets, and presentations.
  • You can write code to open, modify, and save documents.
  • You can use validation functionality to be more certain that your documents conform to the IS29500 standard and will be able to be opened using Microsoft Office and other conforming applications.  Document formats, by their very nature, are involved.  The validation functionality in the Open XML SDK is a big help when writing real-world solutions.

About the Tool

Key features of the tool are:

  • You can compare two Open XML documents to see exact changes in their markup.  This is one of the best ways to learn about Open XML markup.  If you want to understand which elements and attributes represent a feature that you want to interact with, create a document without the feature, copy the document to a new document, modify the new document, and compare to the old.  After determining the elements and attributes that changed, you can research them in the Open XML specification.
  • You can build a document generation program with a minimum of effort.  You supply the tool with a sample document.  You can then generate C# code that that will generate the entire document, a specific part, or a specific element with its children elements.  This code is generated in a style that takes advantage of ‘functional construction’.  By this, I mean that any element (or its descendant elements) can be generated in a single expression.  You don’t need to write multiple statements.  This ability to generate content in an expression instead of a statement means that you can use LINQ queries and projections to formulate new descendant content for an element.  It’s a powerful approach.
  • The ability to explore the Open XML specification, the implementation notes, and the Open XML SDK class hierarchy in the tool means that you have one integrated tool to do much of the work that is necessary to build sophisticated document generation systems.

For more information on the Open XML SDK, samples, snippets, and articles, go to the Open XML Developer Center.

Zeyad Rajabi’s post contains a number of links to videos and other resources.  Also, check out Zeyad’s session from the SharePoint conference.

Gray Knowlton posted about how the Open XML SDK contributes to interoperability.

Erika Ehrli Cabral has a great post that contains a list of a lot of new content available for the Open XML SDK.

When I first started working with the Open XML document formats, the tools and resources available to me were quite limited.  The only released library was System.IO.Packaging.  The Open XML SDK version 1.0 was still a CTP.  We’ve come a very long way since then.  Congratulations to the development team and to everyone else who worked on this excellent technology.

Developing with SharePoint 2010 Word Automation Services

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

There are some tasks that are difficult using the Open XML SDK, such as repagination, conversion to other document formats such as PDF, or updating of the table of contents, fields, and other dynamic content in documents.  Word Automation Services is a new feature of SharePoint 2010 that can help in these scenarios.  It is a shared service that provides unattended, server-side conversion of documents into other formats, as well as some other essential pieces of functionality.  It was designed from the outset to work on servers, and can process high volumes of documents in a reliable and predictable fashion.

I've co-authored a paper, Developing with SharePoint 2010 Word Automation Services, which is published on MSDN.


Modifying an Open XML Document in a SharePoint Document Library

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

On a fairly regular basis, I need to write an example that retrieves an Open XML document from a SharePoint document library, modify the document, and save the document back to the document library.  The correct approach is to use a CAML query to retrieve the document.  This post presents the minimum amount of code to use the SharePoint object model to do this.

This code requires the Open XML SDK, so you will need to download and install it.  You need to add a reference to the assembly.  In addition, you need to add a reference to the WindowsBase assembly and the Microsoft.SharePoint assembly.

See the Open XML Developer Center for lots of information on building applications that work with Open XML documents.

When building console applications for SharePoint 2010, you must target the .NET 3.5 framework.  In addition, you must target ‘Any CPU’, not X86.  The post Developing with SharePoint 2010 Word Automation Services contains explicit instructions for targeting .NET 3.5 and Any CPU.

Here is the smallest C# console application to do this:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;
using Microsoft.SharePoint;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;

classProgram
{
    staticvoid Main(string[] args)
    {
        string siteUrl = "http://localhost";
        using (SPSite spSite = newSPSite(siteUrl))
        {
            Console.WriteLine("Querying for Test.docx");
            SPList list = spSite.RootWeb.Lists["Shared Documents"];
            SPQuery query = newSPQuery();
            query.ViewFields = @"<FieldRef Name='FileLeafRef' />";
            query.Query =
              @"<Where>
                  <Eq>
                    <FieldRef Name='FileLeafRef' />
                    <Value Type='Text'>Test.docx</Value>
                  </Eq>
                </Where>";
            SPListItemCollection collection = list.GetItems(query);
            if (collection.Count != 1)
            {
                Console.WriteLine("Test.docx not found");
                Environment.Exit(0);
            }
            Console.WriteLine("Opening");
            SPFile file = collection[0].File;
            byte[] byteArray = file.OpenBinary();
            using (MemoryStream memStr = newMemoryStream())
            {
                memStr.Write(byteArray, 0, byteArray.Length);
                using (WordprocessingDocument wordDoc =
                    WordprocessingDocument.Open(memStr, true))
                {
                    Document document = wordDoc.MainDocumentPart.Document;
                    Paragraph firstParagraph = document.Body.Elements<Paragraph>()
                        .FirstOrDefault();
                    if (firstParagraph != null)
                    {
                        Paragraph testParagraph = newParagraph(
                            newRun(
                                newText("Test")));
                        firstParagraph.Parent.InsertBefore(testParagraph,
                            firstParagraph);
                    }
                }
                Console.WriteLine("Saving");
                string linkFileName = file.Item["LinkFilename"] asstring;
                file.ParentFolder.Files.Add(linkFileName, memStr, true);
            }
        }
    }
}

Here is the same example in VB.  One thing that is cool about VB is that you can use XML literals to write the CAML query, and then call ToString() to set the Query field of the SPQuery object.

Imports System.IO
Imports System.Threading
Imports Microsoft.SharePoint
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
ModuleModule1
    Sub Main()
        Dim siteUrl AsString = "http://localhost"
        Using spSite AsSPSite = NewSPSite(siteUrl)
            Console.WriteLine("Querying for Test.docx")
            Dim list AsSPList = spSite.RootWeb.Lists("Shared Documents")
            Dim query AsSPQuery = NewSPQuery()
            query.ViewFields = "<FieldRef Name='FileLeafRef' />"
            query.Query = ( _
               <Where>
                   <Eq>
                       <FieldRefName='FileLeafRef'/>
                       <ValueType='Text'>Test.docx</Value>
                   </Eq>
               </Where>).ToString()
            Dim collection AsSPListItemCollection = list.GetItems(query)
            If collection.Count <> 1 Then
                Console.WriteLine("Test.docx not found")
                Environment.Exit(0)
            EndIf
            Console.WriteLine("Opening")
            Dim file AsSPFile = collection(0).File
            Dim byteArray AsByte() = file.OpenBinary()
            Using memStr AsMemoryStream = NewMemoryStream()
                memStr.Write(byteArray, 0, byteArray.Length)
                Using wordDoc AsWordprocessingDocument = _
                    WordprocessingDocument.Open(memStr, True)
                    Dim document AsDocument = wordDoc.MainDocumentPart.Document
                    Dim firstParagraph AsParagraph = _
                        document.Body.Elements(OfParagraph)().FirstOrDefault()
                    If firstParagraph IsNotNothingThen
                        Dim testParagraph AsParagraph = NewParagraph( _
                            NewRun( _
                                NewText("Test")))
                        firstParagraph.Parent.InsertBefore(testParagraph, _
                                                           firstParagraph)
                    EndIf
                EndUsing
                Console.WriteLine("Saving")
                Dim linkFileName AsString = file.Item("LinkFilename")
                file.ParentFolder.Files.Add(linkFileName, memStr, True)
            EndUsing
        EndUsing
    EndSub
EndModule

Assembling Documents on SharePoint 2010 Sites by Merging Content from Excel, PowerPoint, and Word

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

Zeyad Rajabi and Frank Rice have put together a cool article, Assembling Documents on SharePoint 2010 Sites by Merging Content from Excel, PowerPoint, and Word, that shows how to build an innovative document generation solution on SharePoint using the Open XML SDK 2.0 where you merge content from Excel and PowerPoint to create an interesting Word document.  The article uses the new ‘Document Sets’ feature of SharePoint 2010, and uses content controls in a template Word document to configure the assembly.  It contains code to import a chart and table from a spreadsheet to the Word document, and import SmartArt from a presentation to the Word document.  The article walks step-by-step through the process of creating a web part that the user uses to initiate document assembly.

Testing for Base Styles in Open XML WordprocessingML Documents

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

Sometimes you want to process all paragraphs in a document, and filter based on the style name.  However, sometimes it isn’t good enough to just filter on the style name, because if another style inherits from the style of interest, you want to include it in your processing.  A common example is that you want to process all paragraphs that are styled as “Heading1”, and all paragraphs that have a style that is based on “Heading1”.

The following example shows how to iterate through a document, and test whether each paragraph is styled as “Heading1” or one derived from it.  The StyleChainReverseOrder method is an iterator that returns a collection of styles in the style chain; however, the styles in the returned collection are in most-derived to least-derived order.  The StyleChain method returns the collection from least-derived to most-derived, which most often is the order that you want to process them.  In this case, it doesn't really matter, but in others it does, so I included both methods in this post.  The interesting method in this post is the IsStyleBasedOnStyle method.  You can pass two style names to the IsStyleBasedOnStyle method, and it returns true if the style in the styleId argument is based on the style in the basedOnStyleId style.  This example uses the strongly-typed OM, but it's pretty easy to convert to use LINQ to XML if that is what you are using.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml.Linq;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;

classProgram
{
    staticIEnumerable<Style> StyleChainReverseOrder(WordprocessingDocument doc,
        string styleId)
    {
        string current = styleId;
        while (true)
        {
            Style style = doc.MainDocumentPart.StyleDefinitionsPart.Styles
                .Elements<Style>().Where(s => s.StyleId == current)
                .FirstOrDefault();
            yieldreturn style;
            if (style.BasedOn == null)
                yieldbreak;
            current = style.BasedOn.Val;
        }
    }

    staticIEnumerable<Style> StyleChain(WordprocessingDocument doc,
        string styleId)
    {
        return StyleChainReverseOrder(doc, styleId).Reverse();
    }

    staticbool IsStyleBasedOnStyle(WordprocessingDocument doc, string styleId,
        string basedOnStyleId)
    {
        return StyleChain(doc, styleId).Any(s => s.StyleId.Value == basedOnStyleId);
    }

    staticvoid Main(string[] args)
    {
        using (WordprocessingDocument doc =
            WordprocessingDocument.Open("Test.docx", false))
        {
            int cnt = 1;
            foreach (var p in doc.MainDocumentPart.Document.Body.Elements<Paragraph>())
            {
                Console.WriteLine("Paragraph: {0}", cnt++);
                if (p.ParagraphProperties != null)
                {
                    Console.WriteLine("StyleID: {0}",
                        p.ParagraphProperties.ParagraphStyleId.Val);
                    foreach (var s in StyleChain(doc,
                        p.ParagraphProperties.ParagraphStyleId.Val))
                        Console.WriteLine("  {0}", s.StyleId.Value);
                    Console.WriteLine("  Is Based on Heading1: {0}",
                        IsStyleBasedOnStyle(doc,
                            p.ParagraphProperties.ParagraphStyleId.Val, "Heading1"));
                }
                Console.WriteLine();
            }
        }
    }
}

SharePoint 2010 and Office 2010 Videos from the SharePoint Conference 2009

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

We’ve published ten of the most popular videos from the SharePoint Conference.

SharePoint 2010 Videos

Overview of the SharePoint 2010 Developer Platform

This SharePoint Conference video provides code-based demos and a brief look at the major new features, tools, and hosting options in SharePoint 2010.

Developing with the New User Interface Features in SharePoint 2010

This SharePoint Conference video shows how to customize the Ribbon without taking your users out of the look and feel of Microsoft SharePoint 2010.

Visual Studio 2010 SharePoint 2010 Development Tools Overview

This SharePoint Conference video provides an overview of SharePoint 2010 development with Visual Studio 2010, including project and item templates.

Building Solutions with Business Connectivity Services by Using Visual Studio 2010

This SharePoint Conference video shows how to provide multiple ways to expose line-of-business data as Microsoft SharePoint 2010 external lists.

Developing with REST and LINQ in SharePoint 2010

This SharePoint Conference video shows how to use the Client APIs as a programming model for SharePoint lists that do not have to run on the server.

Office 2010 Videos

What's New in Office 2010 for Developers

This SharePoint Conference video talks about new platform improvements in Office 2010, compatibility, UI programmability, and server-side services.

Develop Advanced Access Web Databases and Publish to SharePoint

This SharePoint Conference video shows how to build codeless Microsoft Access 2010 Web databases that run on both the client and the server.

Excel and Excel Services: The Top 10 Features You Need to Know

This SharePoint Conference video shows how to use the new features in Microsoft Excel 2010 and Excel Services 2010 to create rich business solutions.

Customizing Office 2010 Backstage View and Ribbon

This SharePoint Conference video shows how the Microsoft Office 2010 Backstage view can be programmatically customized for your solution.

Deep Dive Open XML and the Open XML SDK

This SharePoint Conference video shows how to use Open XML to create and edit documents on the server without using COM-based automation.

Download the PowerPoint Files for these Sessions

Using Content Controls to give Semantic Meaning to Content in Open XML WordprocessingML Documents

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

A wide variety of business applications can take advantage of content controls to give semantic meaning to content in Open XML WordprocessingML documents.  However, most applications that can benefit from content controls fit into one of three broad categories:

  • Document generation systems that use a template document for configuration.
  • Content publishing systems that transform WordprocessingML to another document format.
  • Collaboration systems that extract data and content from word-processing documents.

We’ve written and published three MSDN articles that provide guidance around these three scenarios, as well as links to a number of resources to help get started.

Building Document Generation Systems from Templates with Word 2010 and Word 2007

Building Publishing Systems that Use Word 2010 or Word 2007

Using Open XML WordprocessingML Documents as Data Sources

Of course, these are not the only types of applications that can benefit from content controls.  I’ve used content controls for a number of other purposes, including using them to delineate code that you want to test using a test harness.  However, in speaking with a number of customers who use content controls, most uses fit into one of the three above categories.

Determining if an Open XML WordprocessingML Document contains Tracked Changes

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

Processing tracked changes (sometimes known as tracked revisions) is something important that you should full understand when writing Open XML applications.  If you first accept all tracked revisions, your job of processing or transforming the WordprocessingML is made significantly easier.

I’ve written an MSDN article, Accepting Revisions in Open XML Word-Processing Documents, which details the semantics of the elements and attributes of WordprocessingML that hold tracked changes information.  Further, I’ve written code, in CodePlex.com/PowerTools, that implements tracked changes as detailed in the above article.  Go to the Downloads tab, and download RevisionAccepter.zip.

However, there are other scenarios where you want to process documents that you are guaranteed that have no tracked changes, and due to certain business requirements you do not want to automatically accept tracked changes.  You might have a SharePoint document library that contains no documents with tracked changes.  Before users add the document to that document library, you want them to consciously and intentionally address and accept all tracked revisions.  Accepting revisions as part of the process of checking the document into the document library would circumvent the people portion of this process, where you want each person to manually examine their documents and resolve any issues.

The MSDN article, Identifying Open XML Word-Processing Documents with Tracked Revisions, explains in detail how to determine whether a document contains tracked revisions, with samples in both Visual Basic and C#.

Using the Open XML SDK from within a Managed Add-In

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

When you are writing code for an Office managed add-in, you can use the Open XML SDK to manipulate the current document in a whole variety of ways.  This is a very powerful technique, which in some circumstances can give your application much better performance.  This is the key point behind the paper that Anil Kumar, Ansari M, Sarang Datye, and Sunil Kumar wrote, Increasing Performance of Word Automation for large amount of data using Open Xml SDK.  In that article, the authors provide the code to do exactly this.  Once you understand the mechanics of using the Open XML SDK from within Office automation, it opens up a set of scenarios that would be very difficult to otherwise implement because of performance limitations.


Open XML Package Editor Power Tool for Visual Studio 2010

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

The VSTO team today announced the release of the new Open XML Package Editor for Visual Studio 2010!  This is an indispensible tool in every Open XML developer’s toolbox.

This Power Tool is a Visual Studio add-in that provides an easy way to manually edit Open XML documents.  Once you install the add-in, you can drag an Open XML document into Visual Studio, and browse through the parts, and open specific parts for editing in Visual Studio's XML editor.  Visual Studio doesn't keep the file open, so you can use it in the following scenarios:

  • You can using Word 2010 to create a file, save the file, and close Word.  You can then drag the document onto Visual Studio, and look at the markup that was created.  You can then open the doc again in Word, change it, close it, and then Visual Studio will tell you that the file has changed outside the editor.  You can tell VS to reload the doc, and see the changed markup.  Because the Open XML Package Editor doesn't keep the file open, you don't need to close the file in VS before opening it again in Word.
  • You can write a program to manipulate the document in C# or VB.  If you have the Open XML document open in VS, you can run the program, and VS will tell you that the file has changed, and you can reload to see the results of your modifications.
  • You can manually modify the XML, and open the file in Word to see if your manual modifications worked and did what you wanted.  You can then close the doc, tweak the XML, and open it in Word again.

This is a very convenient way to examine and edit the XML in Open XML docs!

Of course, Word does not indent the XML when serializing.  You can use the XML formatting option in the XML editor to make it easy to view and edit the XML.  I like to set an option in VS so that when you format the XML, it lines up the attributes.  This is a far easier way to see the XML when there are lots of attributes, or lots of namespaces.  To set this option, select Tools, Options on the menu, expand Text Editor, expand XML, click on Formatting, and set the option, "Align attributes each on a separate line".

To format an XML document, select Edit, Advanced, Format Document on the menu (or type ^e, d).

Download: Open XML Package Editor Power Tool for Visual Studio 2010

Important Note: You must exit Visual Studio 2010 before installing.  If you install while Visual Studio 2010 is running, the package editor will not work until you exit and restart Visual Studio.

Writing a Recursive Descent Parser using C# and LINQ

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

Recursive descent parsers are one of the easier types of parsers to implement.  Given a properly defined grammar, you write a class for each production in the grammar, and you write one fairly simple method in each class.  Each of those methods returns a ‘production’ based on its source productions (tokens) passed to it.  Once you understand the pattern for writing those methods, and how to translate the grammar to the pattern, it’s possible to code those methods just about as fast as you can type.  In those methods, you can identify errors and throw exceptions as necessary.  After parsing, you have a syntax tree (sometimes called a parse tree) that you can examine or modify.

This post is one in a series on using LINQ to write a recursive-descent parser for SpreadsheetML formulas.  You can find the complete list of posts here.

For a typical professional developer, there are lots of benefits to understanding grammars, recursive descent parsers, and syntax trees.  At one point in my process of learning C#, I read the specification of the C# language, which includes the grammar for the language.  Understanding the grammar, and correlating the grammar to the text verified that I understood each construct, and reduced the chance that I had any misconceptions about the exact semantics of the language.

Going through the process of writing a small recursive descent parser is very helpful too.  This makes sure that you fully understand how grammars are put together, and what they mean.

Finally, as a professional developer, you may come across situations where this information is useful.  Perhaps you need to develop a small domain-specific language (DSL).  There are a number of approaches to developing DSLs, each appropriate in their own situation, and there are a few situations where it is most appropriate to define a grammar and write a parser.

As an Open XML developer, understanding recursive descent parsers is useful.  Formulas in a spreadsheet are stored as strings.  Without parsing those strings, those formulas are opaque.  We can’t examine a formula programmatically and determine what it is doing or what cells it references.  Fortunately, with Open XML, we have a grammar for those formulas.  J

As I was contemplating implementing some functionality (searching for all formulas across a range of workbooks and worksheets that reference a specific cell), I started contemplating writing a recursive descent parser in C# and LINQ.  It promised to be a super-fun programming project.  There are some huge benefits that we gain by writing the parser using LINQ.  Many of those methods that we need to write in the production classes are significantly simplified by the use of LINQ.

Just to make it clear why we need this parser:

In order to examine references to cells in formulas, we must parse those formulas according to the grammar.

For instance, if we want to search for all references to cell A3, we may see the following formula in cell A1:

=A2+A3

If searching for references to A3, this formula should be a match.

In this same spreadsheet, there may be another cell with a name of “ZZZZA3”, and that name could be used in the formula:

=A2+ ZZZZA3

This formula should not be a match.

There are practical reasons for writing this code.  As an Open XML developer, I don’t like it that formulas in cells are opaque to me.  I want to be able to look inside them and process Open XML spreadsheets in new and interesting ways.

However, the most important reason to write these posts is for the sheer, absolute fun of it.  This whole project will be fun.  It will be fun explaining how recursive descent parsers work.  It will be fun to explore how LINQ makes it easy to write recursive descent parsers.  And it will be fun to be able to query Open XML spreadsheets in cool ways.

Processing all Content Parts in an Open XML WordprocessingML Document

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

In Open XML WordprocessingML documents, there are five types of parts that can contain content such as paragraphs (with or without tracked revisions), tables, rows, cells, and any of a variety of content controls:

  • Main document part
  • Header parts (there can be more than one)
  • Footer parts (there can be more than one)
  • Endnotes (there can be zero or one)
  • Footnotes (there can be zero or one)

There are certain Open XML programming scenarios where you need to process all varieties of parts that contain content:

  • You need to search for specific words in a document, regardless of where those words occur.
  • You need to accept tracked changes anywhere they appear in the document.
  • You need to process content controls anywhere they occur in the document, perhaps to bind them to XML in a custom XML part.

The following example shows how to search for all content controls in a document, regardless of whether those content controls are in the main document part, in the headers/footers, or in endnotes/footnotes.  This example uses LINQ to XML.  If you are using the strongly-typed OM of the Open XML SDK, the code would be identical, except for the code to actually process the content controls.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

publicstaticclassExtensions
{
    publicstaticXDocument GetXDocument(thisOpenXmlPart part)
    {
        XDocument partXDocument = part.Annotation<XDocument>();
        if (partXDocument != null)
            return partXDocument;
        using (Stream partStream = part.GetStream())
        using (XmlReader partXmlReader = XmlReader.Create(partStream))
            partXDocument = XDocument.Load(partXmlReader);
        part.AddAnnotation(partXDocument);
        return partXDocument;
    }
}

classProgram
{
    privatestaticvoid IterateContentControlsForPart(OpenXmlPart part)
    {
        XNamespace w = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
        XDocument doc = part.GetXDocument();
        foreach (var sdt in doc.Descendants(w + "sdt"))
        {
            Console.WriteLine("Found content control");
            Console.WriteLine("=====================");
            Console.WriteLine(sdt.ToString());
            Console.WriteLine();
        }
    }

    publicstaticvoid IterateContentControls(WordprocessingDocument doc)
    {
        IterateContentControlsForPart(doc.MainDocumentPart);
        foreach (var part in doc.MainDocumentPart.HeaderParts)
            IterateContentControlsForPart(part);
        foreach (var part in doc.MainDocumentPart.FooterParts)
            IterateContentControlsForPart(part);
        if (doc.MainDocumentPart.EndnotesPart != null)
            IterateContentControlsForPart(doc.MainDocumentPart.EndnotesPart);
        if (doc.MainDocumentPart.FootnotesPart != null)
            IterateContentControlsForPart(doc.MainDocumentPart.FootnotesPart);
    }

    staticvoid Main(string[] args)
    {
        using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", false))
            IterateContentControls(doc);
    }
}

Recursive Descent Parser using LINQ: The Augmented Backus-Naur Form Grammar

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

A grammar is a device to define syntax for a language.  A grammar is made up of rules, sometimes called productions.  Each rule defines a symbol, when can then be further used in other rules.  Grammars are not hard to understand; most developers instinctively understand grammars when they see them.  When you learn a new programming language, almost without thinking about it, you assemble some version of the grammar in your head.  One of the benefits of reading the grammar of a language is to make sure that the conceptual grammar you’ve mentally assembled matches the actual grammar of the language.

This post is one in a series on using LINQ to write a recursive-descent parser for SpreadsheetML formulas.  You can find the complete list of posts here.

Microsoft devotes a great deal of effort towards writing interoperability documents.  If you are a document format geek like me (or even if you only peripherally use document formats), you can find a treasure trove of information on MSDN under Microsoft Office File Format Documents.

The grammar that we want to use to parse SpreadsheetML formulas is in the interoperability document: Excel Extensions to the Office Open XML SpreadsheetML File Format (.xlsx) Specification.  This grammar is expressed in Augmented Backus–Naur Form (ABNF).

In this post, I’m going to distill ABNF to just the set of rules and grammar syntax that we need to understand to write a parser for the grammar in the Excel Extensions Specification (linked above).  I’ll take all examples of ABFN grammar from that spec.

Terminals

Terminals express the actual text of the programming language.  A grammar expresses a terminal either as a quoted string, or sometimes as a list of values, such as the hex values for the digits from “0” – “9”.

decimal-digit= %x30-39

Following is another symbol that uses a literal string terminal.

full-stop = "."

Just to be clear, this is the terminal:

decimal-digit= %x30-39
               ^^^^^^^

Where this is the grammar rule:

decimal-digit= %x30-39
^^^^^^^^^^^^^^^^^^^^^^

The terminal in the full-stop rule is just the string literal:

full-stop = "."
            ^^^

Or

A symbol can be comprised of one OR another symbol.  In ABNF, “OR” is expressed as a forward slash.  The following symbol definition defines constant to be one of the list of varieties of constants.

constant = error-constant / logical-constant / numerical-constant / string-constant / array-constant

The logical-constant symbol is one of two terminals, expressed as quoted strings:

logical-constant = "FALSE" / "TRUE"

Adjacent Symbols

Two symbols separated by a space indicate that you must first have the one symbol, followed by the second symbol.  The following rule specifies that the fractional-part symbol requires a full-stop followed by a digit-sequence.

fractional-part = full-stop digit-sequence

Optional

A symbol or terminal that is optional is surrounded by square brackets.  The following definition of the exponent-part symbol indicates that the sign before the digit-sequence is optional.

exponent-part = exponent-character [ sign ] digit-sequence

The definition of exponent-character is of course:

exponent-character = "E"

The following examples could produce an exponent-part symbol:

E10
E+10
E-10

Zero or More

If a symbol is preceded by an asterisk (*), zero or more of those symbols can occur at that point in the production of the symbol being defined.  The following rule says that an expression can be made up of a ref-expression, or zero or more instances of the whitespace symbol, followed by a nospace-expression, followed by zero or more instances of the whitespace symbol.

expression= ref-expression / *whitespace nospace-expression *whitespace

The symbol bring-to-front-params is defined to be an open parenthesis followed by zero or more space symbols, followed by a close parenthesis.

bring-to-front-params = "(" *space ")"

N or More

If a symbol is preceded by a number followed by an asterisk, it indicates that you must have at least n instances of that symbol.  The following defines a digit-sequence to be one or more decimal digits:

digit-sequence = 1*decimal-digit

Exactly N Symbols

If a symbol is preceded by a number, it indicates that you must have exactly n instances of that symbol.  The following defines that an escaped-double-quote is comprised of exactly two adjacent double-quote symbols.

escaped-double-quote = 2double-quote

N to M Symbols

The following defines that the and-params symbol consists of an open parenthesis, followed by either a single argument-expression or an argument, followed by 1-254 comma/argument pairs, followed by a close parenthesis.

and-params = "(" (argument-expression / (argument 1*254("," argument))) ")"

Grouped Symbols

Symbols in a production can be grouped by parentheses, and then preceded by a symbol quantifier.  The following defines that the constant-list-rows symbol consists of one constant-list-row, followed by zero or more pairs of symbols, where the pair is a semicolon, followed by a constant-list-row.

constant-list-rows = constant-list-row *(semicolon constant-list-row)

Exceptions and Special Rules

In some places, the grammar defines some special rules in text.  In our case, the following special rule is defined for an array-constant:

An array-constant MUST NOT contain
An array-constant.
Columns or rows of unequal length.

In addition, following the grammar, there is additional text that describes further restrictions or exceptions.  As necessary, we’ll need to incorporate those restrictions.

You can see that grammar rules are not very complex.

My approach for coding the recursive descent parser will be to paste the grammar rule directly into the class that implements the rule as a C# comment.  This makes it very easy to correlate the grammar to the code that implements the rule.

In the next post, I’m going to define a super-small grammar that is a subset of the Excel formulas grammar.  Then in subsequent posts, we’ll implement and test a parser for that small grammar.

 

Recursive Descent Parser: A Simple Grammar

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

To learn how recursive descent parsers work, it is helpful to implement a very simple grammar, so for pedagogical purposes, I’ve defined a grammar for simple arithmetic expressions. The parser will construct a syntax tree from expressions that we can then examine as necessary. Just for fun, after implementing the parser, we will write a small method that will evaluate the formulas.

This post is one in a series on using LINQ to write a recursive-descent parser for SpreadsheetML formulas.  You can find the complete list of posts here.

In these expressions, operands are floating point numbers, but for simplicity, I’ve eliminated the ability to have exponents. Floating point numbers are the only variety of operand; to demonstrate how to write the parser, it’s not necessary to include the idea of variables or other types of operands. When writing the parser for Excel formulas, we’ll need to deal with both exponents and other types of operands, as well as many more varieties of issues.

There are five binary (sometimes called infix) operators. Operator precedence needs to be honored when parsing formulas:

Operator

Precedence

+

1

-

1

*

2

/

2

^

3

 

There is one prefix operator, ‘-‘, which has higher precedence than the infix operators.

Operands can consist of a significand (sometimes called mantissa) part, followed by a fractional part.

The grammar allows for white space at appropriate places in the expression. The allowance of white space in this simple grammar parallels the allowance of white space in the Excel Extensions to the Office Open XML SpreadsheetML File Format (.xlsx) Specification.

The grammar allows for use of parentheses.

Here are some examples of formulas that should parse properly:

"(1+3)/3"
" (1+3) "
"-123"
"1+2*(-3)"
"1+2*( - 3)"
"12.34"
".34"
"-123+456"
"-(123+456)"
" ( 123 + 456 ) "
"1+2-3*4/5^6"
"-.34"
"-12.34"

One of the important characteristics of a parser is to report when an expression is invalid per the grammar. Here are some expressions that should throw exceptions:

"-(123+)"
"-(*123)"
"*123"
"123a"
"1."
"--1"

Here is the grammar, as I’ve defined it. There are only eleven rules (other than the rules that are comprised only of terminals):

formula = expression
expression = *whitespace nospace-expression *whitespace
nospace-expression = open-parenthesis expression close-parenthesis /
    expression infix-operator expression / numerical-constant / prefix-operator expression
numerical-constant = [neg-sign] significand-part
significand-part = whole-number-part [fractional-part] / fractional-part
whole-number-part = digit-sequence
fractional-part = full-stop digit-sequence
neg-sign = minus
digit-sequence = 1*decimal-digit
prefix-operator = plus / minus
infix-operator = caret / asterisk / forward-slash / plus / minus

// The following symbols are comprised only of terminals.
decimal-digit = %x30-39
whitespace = %x20
plus = "+"
minus = "-"
asterisk = "*"
forward-slash = "/"
caret = "^"
full-stop = "."
open-parenthesis = "("
close-parenthesis = ")"

In the next post in this series, I’ll start discussing some of the C# / LINQ techniques that we can use to make coding this parser super-easy.

Viewing all 35 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>