Lets Learn

Opinion Matters

Archive for the ‘Spreadsheet’ Category

How to Delete a Column in Excel using Open XML SDK

Posted by Ankush on June 13, 2011

Recently there was a question posted on my blog as how we can delete a column in spreadsheet (Excel) using Open XML SDK. Actually, it’s a very tricky requirement because you need to update so much of information i.e. changing the cell index. So it’s really depends how you want to achieve it. You may want to write a COM Addin/VSTO addin to do the job but then you need to deploy the add-in at client side. So, if you are looking for a server side solution, here is the approach. You may want to optimize it according to your need but I will explain in detail what/why I did.

So, let’s start with the concept. Let’s say have created a sample Excel workbook like this:

clip_image001

If you look the XML representation of the data, it will be like this

<x:sheetData xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">

<x:row r="1" spans="1:3" x14ac:dyDescent="0.25" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<x:c r="A1">
<x:v>1</x:v>
</x:c>
<x:c r="B1">
<x:v>2</x:v>
</x:c>
<x:c r="C1">
<x:v>3</x:v>
</x:c>
</x:row>

<x:row r="2" spans="1:3" x14ac:dyDescent="0.25" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<x:c r="A2">
<x:v>1</x:v>
</x:c>
<x:c r="B2">
<x:v>2</x:v>
</x:c>
<x:c r="C2">
<x:v>3</x:v>
</x:c>
</x:row>
</x:sheetData>

So pay attention to how rows and cells are declared. So the declaration is:

Row1
Cell1
Cell2
Cell3

Row2
Cell1
Cell2
Cell3

So for ex: if you delete cell2 from all the rows, you need to update the rowIndex (x:r) in the remaining cells. Let’s say you delete column B , the updated XML should look like this:

<x:row r="1" spans="1:3" x14ac:dyDescent="0.25" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<x:c r="A1">
<x:v>1</x:v>
</x:c>
<x:c r="B1">
<x:v>3</x:v>
</x:c>
</x:row>

<x:row r="2" spans="1:3" x14ac:dyDescent="0.25" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<x:c r="A2">
<x:v>1</x:v>
</x:c>
<x:c r="B2">
<x:v>3</x:v>
</x:c>
</x:row>

And here is the C# code:

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using DocumentFormat.OpenXml;

using DocumentFormat.OpenXml.Packaging;

using DocumentFormat.OpenXml.Spreadsheet;

using System.Xml.Linq;

 

namespace DeleteSpreadSheetColumn

{

    class Program

    {

        static XDocument columnsXL;

        static void Main(string[] args)

        {

 

            LoadColumnName("");

            string result = DeleteAColumn(@"C:\Users\abhatia\Desktop\Test1.xlsx", "Sheet1", "I");

        }

 

        public static void LoadColumnName(string XMLFile)

        {

            columnsXL = XDocument.Load(@"C:\Users\abhatia\Desktop\ExcelColumn.xml");

 

        }

        // Given a document, a worksheet name, a column name

        // deletes a column from a specified worksheet.

        public static string DeleteAColumn(string docName, string sheetName, string colName)

        {

            // Open the document for editing.

            using (SpreadsheetDocument document = SpreadsheetDocument.Open(docName, true))

            {

                IEnumerable<Sheet> sheets = document.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>().Where(s => s.Name == sheetName);

                if (sheets.Count() == 0)

                {

                    // The specified worksheet does not exist.

                    return sheetName + "doesn't exist";

                }

                string relationshipId = sheets.First().Id.Value;

                WorksheetPart worksheetPart = (WorksheetPart)document.WorkbookPart.GetPartById(relationshipId);

 

 

                // Get the Total Rows

                IEnumerable<Row> rows = worksheetPart.Worksheet.GetFirstChild<SheetData>().Elements<Row>();

                if (rows.Count() == 0)

                {

                    return "Rows doesn't exist";

 

                }

 

                // Loop through the rows and adjust Cell Index

                foreach (Row row in rows)

                {

                    int index = (int)row.RowIndex.Value;

                    IEnumerable<Cell> cells = row.Elements<Cell>();

 

                    IEnumerable<Cell> cellToDelete = cells.Where(c => string.Compare(c.CellReference.Value, colName + index, true) == 0);

                    if (cellToDelete.Count() > 0)

                        cellToDelete.First().Remove();

 

                    int intColumnToBeDel = (int)(from c in columnsXL.Descendants("col") where c.Attribute("name").Value == colName select c.NodesBeforeSelf().Count()).FirstOrDefault() + 1;

 

 

                    foreach (Cell cell in cells)

                    {

 

                        string col = GetColunName(cell.CellReference.Value);

                        int intColumnPres = (int)(from c in columnsXL.Descendants("col") where c.Attribute("name").Value == col select c.NodesBeforeSelf().Count()).FirstOrDefault() + 1;

 

                        if (intColumnPres > intColumnToBeDel)

                        {

 

                            string ColNamefromXML = ((System.Xml.Linq.XElement)(from c in columnsXL.Descendants("col")

                                                                                where c.Attribute("name").Value + index == cell.CellReference.Value

                                                                                select c.PreviousNode).FirstOrDefault()).FirstAttribute.Value;

                            cell.CellReference.Value = ColNamefromXML + index;

 

                        }

                    }

                }

 

                worksheetPart.Worksheet.Save();

                return "Deleted";

 

            }

        }

 

 

 

        public static bool isNumeric(string val, System.Globalization.NumberStyles NumberStyle)

        {

            Double result;

            return Double.TryParse(val, NumberStyle,

                System.Globalization.CultureInfo.CurrentCulture, out result);

        }

 

        private static string GetColunName(string p)

        {

            Boolean breakLoop = true;

            while (breakLoop)

            {

                if (isNumeric(p.Last().ToString(), System.Globalization.NumberStyles.Integer))

                    p = p.Substring(0, p.Length - 1);

                else

                    breakLoop = false;

            }

            return p;

        }

    }

}

 

Please note that ExcelColumn.xml is just an XML file which contains all the column name. I find it an easy way to get the column name. You can download the complete project and XML file from http://cid-956eb159f7975780.skydrive.live.com/redir.aspx?resid=956EB159F7975780!197

Feel free to optimize it according to your need and let me know if you found any problem with this one!!!

Posted in Open XML SDK, Spreadsheet | Tagged: , , , | 3 Comments »

How to Insert a Hyperlink in Excel using Open XML SDK 2.0

Posted by Ankush on March 4, 2011

Consider a scenario, you want to insert a hyperlink in Excel 2010 (or Excel 2007) file programmatically using a web application. You should not use Office Object Model in a web application, Please refer http://support.microsoft.com/kb/257757. So what we can do??

As we know that Office 2007 and 2010 file format is XML based (just rename any Office 2007/2010 document to .zip and you can see a lots of XML files which makes a document and they are simply zipped in a one file). To manipulate these files, MS has introduced a library, Open XML SDK 2.0. The SDK is built on the System.IO.Packaging API and provides strongly-typed classes to manipulate documents that adhere to the Office Open XML File Formats Specification. The Office Open XML File Formats specification is an open, international, ECMA-376, Second Edition and ISO/IEC 29500 standard. The Open XML file formats are useful for developers because they are an open standard and are based on well-known technologies: ZIP and XML.

The Open XML SDK 2.0 simplifies the task of manipulating Open XML packages and the underlying Open XML schema elements within a package. The Open XML SDK 2.0 encapsulates many common tasks that developers perform on Open XML packages, so that you can perform complex operations with just a few lines of code. Some benefits if you use Open XML SDK:

Open Format: These formats use XML and ZIP, and they are fully documented. Anyone can get the full specs on the formats and it is a royalty free license for anyone that wants to work with the files.

Compressed: Files saved in these new XML formats are less than 50% the size of the equivalent file saved in the binary formats. This is because we take all of the XML parts that make up any given file, and then we ZIP them. We chose ZIP because it’s already widely in use today and we wanted these files to be easy to work with.

Robust: Between the usage of XML, ZIP, and good documentation the files get a lot more robust. By compartmentalizing our files into multiple parts within the ZIP, it becomes a lot less likely that an entire file will be corrupted (instead of just individual parts). The files are also a lot easier to work with, so it’s less likely that people working on the files outside of Office will cause corruptions.

Backward compatible: You don’t have to use the new version of Office to take advantage of these formats. You can use Office Compatibility pack to open the Office 2007 document in older version.

Binary Format support: You can still use the current binary formats with the new version of Office. In fact, people can easily change to use the binary formats as the default if that’s what they’d rather do.

New Extensions: The new formats has new extensions (.docx, .pptx, .xlsx) so you can tell what format the files you are dealing with are, but to the average end user they’ll still just behave like any other Office file. Double click & it opens in the right application.

Is also important to know that what SDK CAN’T do:

The Open XML SDK 2.0:

  • Does not replace the Microsoft Office Object Model and provides no abstraction on top of the file formats. You must still understand the structure of the file formats to use the Open XML SDK 2.0.
  • Does not provide functionality to convert Open XML formats to and from other formats, such as HTML or XPS.
  • Does not guarantee document validity of Open XML Formats when you use the Open XML SDK 2.0 or if you decide to manipulate the underlying XML directly.
  • Does not provide application behavior such as layout functionality in Word or recalculation, data refresh, or adjustment functionalities in Excel.

So lets have a look at the solution. I will try to build the solution from the scratch so that you can learn how you can use Open XML tools and leverage this information in other exercise as well.

So lets start!!

Step 1: Download the Open XML SDK and the tool from http://www.microsoft.com/downloads/en/details.aspx?FamilyId=C6E744E5-36E9-45F5-8D8C-331DF206E0D0&displaylang=en. Please note that there are 2 links, one is for the tool and the second one is the actual library.

Step 2: So the objective is to insert a link into a cell. For the simplicity, we will assume that we want to insert the link into cell A1 of Sheet1. Lets create an Excel 2010 workbook and place a hyperlink in cell A1 manually. Once done, Excel file will look like this

clip_image001

Save this file and close it.

Step 3: If you haven’t changed the default installation location of the tool, they will appear here: C:\Program Files (x86)\Open XML SDK\V2.0\tool. Go ahead and run OpenXmlSdkTool.exe from here.

Step 4: You can simply open Office file in this tool and look at the different XML parts. (To understand different XML parts, have a look:https://ankushbhatia.wordpress.com/2010/03/04/basics-of-spreadsheet/). So go ahead and open the file in this tool. Once done, it will look like this:

clip_image002

Step 5: We know that we inserted the hyperlink in cell A1 of Sheet1 so go ahead and browse the Sheet1.xml file. Once you select the Sheet1, it gives you different child elements of it. Please select hyperlinks and click on “reflect Code” to view the XML. Make sure to follow the correct tree structure to get to Sheet1.xml and hyperlink object. After this, it will appear as:

clip_image003

Pay attention to the right side. There are 2 windows. Upper window shows XML of the section you have selected in the “Document Explorer” and the bottom window shows the C# code to generate the XML..Kind of cool….isn’t?? So now what we have..? We have XML which is required to generate a hyperlink and the equivalent C# code. Lets have a closure look at the XML

<x:hyperlinks xmlns:x=”http://schemas.openxmlformats.org/spreadsheetml/2006/main”&gt;

<x:hyperlink ref=”A1″ r:id=”rId1″ xmlns:r=”http://schemas.openxmlformats.org/officeDocument/2006/relationships&#8221; />

</x:hyperlinks>

There is no information about the hyperlink text. If you look at the relationship ID “A1″, it appears as

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”true”?>

<Relationships xmlns=”http://schemas.openxmlformats.org/package/2006/relationships“><Relationship TargetMode=”External” Target=”http://www.microsoft.com/” Type=”http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink” Id=”rId1″/></Relationships>

So where is the actual text (remember I entered Microsoft)?

As you know that Excel keeps all the text strings in a separate XML file, SharedSTring.xml and refer it back based on the index. So lets see how the data appear in cell A1

Have a look at the SheetData [where it keeps all the data . It is specific to a sheet] in the tool

clip_image004

If you look at the XML, it contains the information about the row which contains the data.

<x:sheetData xmlns:x=”http://schemas.openxmlformats.org/spreadsheetml/2006/main”&gt;

<x:row r=”1″ spans=”1:1″ x14ac:dyDescent=”0.25″ xmlns:x14ac=”http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac”&gt;

<x:c r=”A1″ s=”1″ t=”s”>

<x:v>0</x:v>

</x:c>

</x:row>

</x:sheetData>

Lets try to decode it.

<x:row r=”1″ spans=”1:1″ x14ac:dyDescent=”0.25″ xmlns:x14ac=”http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac”&gt; :

r=1 means there is one row.

There is one child element <x:c r=”A1″ s=”1″ t=”s”>.

X:c -> Column
r=A1 -> Referenced cell is “A1″

S=1 and t=”S” it’s a string value and value is referenced from Shared String table. And S=1 means that a style is applied and the defination of the style is taken from index 1 which can be found from Style.xml

Now analyze the SharedSTring.XML using the same tool

clip_image005

Here you will see the Microsoft string. It is referred back in SheetData using x:v node. The starting index is 0.

So to summarize, we need to do:

  1. Create a hyperlink object and insert it before the PageMargin node. Please note that Excel is enforcing a strict ordering of the elements. So make sure to add it before PageMargines node.
  2. Create an entry in the SharedString Table
  3. Create an entry in the SheetData section

So why we did this exercise. Every now and then we face situation where we want to these kind of things in Office using Open XML. So easy thing is, do the stuff in UI and analyze the XML and C# code and then write your own.

Lets move on and utilize this information.

Step 6: Create a new Excel file and save it. We will insert a hyperlink in this file. Create a new VS 2010 project and open the new blank file.

Step 7: Add a reference to DocumentFormat.OpenXML and WindowsBase.dll from .NET tab

Step 8: Add a namespace using XL = DocumentFormat.OpenXml.Spreadsheet;

Step 9: So basically you need to

a. Open the workbook
b. Get to the sheet where you want to add the hyperlink
c. Add the hyperlink
d. Add an entry in SheetData
e. Add an entry in SharedString.xml

The below code does exactly that.

using (SpreadsheetDocument document = SpreadsheetDocument.Open(@"C:\Users\abhatia\Desktop\Sample1.xlsx", true))

           {

               IEnumerable<Sheet> sheets = document.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>().Where(s => s.Name == "Sheet1");

               if (sheets.Count() == 0)

               {

                   // The specified worksheet does not exist.

                   return;

               }

               string relationshipId = sheets.First().Id.Value;

               WorksheetPart worksheetPart = (WorksheetPart)document.WorkbookPart.GetPartById(relationshipId);


               Hyperlinks hyperlinks1 = new Hyperlinks();

               Hyperlink hyperlink1 = new Hyperlink() { Reference = "A1", Id = "UNIQUE" };

               hyperlinks1.Append(hyperlink1);


               PageMargins pageMargins = worksheetPart.Worksheet.Descendants<PageMargins>().First();

               worksheetPart.Worksheet.InsertBefore<Hyperlinks>(hyperlinks1, pageMargins);

               worksheetPart.Worksheet.Save();

               worksheetPart.AddHyperlinkRelationship(new System.Uri("http://www.google.com/", System.UriKind.Absolute), true, "UNIQUE");


               // Create an element in SheetData


               // Get the SharedStringTablePart. If it does not exist, create a new one.

               SharedStringTablePart shareStringPart;

               if (document.WorkbookPart.GetPartsOfType<SharedStringTablePart>().Count() > 0)

               {

                   shareStringPart = document.WorkbookPart.GetPartsOfType<SharedStringTablePart>().First();

               }

               else

               {

                   shareStringPart = document.WorkbookPart.AddNewPart<SharedStringTablePart>();

               }



               // Insert the text into the SharedStringTablePart.

               int index = InsertSharedStringItem("www.google.com", shareStringPart);


               // Insert cell A1 into the new worksheet.

               Cell cell = InsertCellInWorksheet("A", 1, worksheetPart);


               // Set the value of cell A1.

               cell.CellValue = new CellValue(index.ToString());

               cell.DataType = new EnumValue<CellValues>(CellValues.SharedString);


               // Save the new worksheet.

               worksheetPart.Worksheet.Save();



           }

 

Here are the helper functions (taken from MS site):

// Given text and a SharedStringTablePart, creates a SharedStringItem with the specified text 

// and inserts it into the SharedStringTablePart. If the item already exists, returns its index.

private static int InsertSharedStringItem(string text, SharedStringTablePart shareStringPart)

{

    // If the part does not contain a SharedStringTable, create one.

    if (shareStringPart.SharedStringTable == null)

    {

        shareStringPart.SharedStringTable = new SharedStringTable();

    }


    int i = 0;


    // Iterate through all the items in the SharedStringTable. If the text already exists, return its index.

    foreach (SharedStringItem item in shareStringPart.SharedStringTable.Elements<SharedStringItem>())

    {

        if (item.InnerText == text)

        {

            return i;

        }


        i++;

    }


    // The text does not exist in the part. Create the SharedStringItem and return its index.

    shareStringPart.SharedStringTable.AppendChild(new SharedStringItem(new DocumentFormat.OpenXml.Spreadsheet.Text(text)));

    shareStringPart.SharedStringTable.Save();


    return i;

}

 

// Given a column name, a row index, and a WorksheetPart, inserts a cell into the worksheet. 

       // If the cell already exists, returns it. 

       private static Cell InsertCellInWorksheet(string columnName, uint rowIndex, WorksheetPart worksheetPart)

       {

           Worksheet worksheet = worksheetPart.Worksheet;

           SheetData sheetData = worksheet.GetFirstChild<SheetData>();

           string cellReference = columnName + rowIndex;


           // If the worksheet does not contain a row with the specified row index, insert one.

           Row row;

           if (sheetData.Elements<Row>().Where(r => r.RowIndex == rowIndex).Count() != 0)

           {

               row = sheetData.Elements<Row>().Where(r => r.RowIndex == rowIndex).First();

           }

           else

           {

               row = new Row() { RowIndex = rowIndex };

               sheetData.Append(row);

           }


           // If there is not a cell with the specified column name, insert one.  

           if (row.Elements<Cell>().Where(c => c.CellReference.Value == columnName + rowIndex).Count() > 0)

           {

               return row.Elements<Cell>().Where(c => c.CellReference.Value == cellReference).First();

           }

           else

           {

               // Cells must be in sequential order according to CellReference. Determine where to insert the new cell.

               Cell refCell = null;

               foreach (Cell cell in row.Elements<Cell>())

               {

                   if (string.Compare(cell.CellReference.Value, cellReference, true) > 0)

                   {

                       refCell = cell;

                       break;

                   }

               }


               Cell newCell = new Cell() { CellReference = cellReference };

               row.InsertBefore(newCell, refCell);


               worksheet.Save();

               return newCell;

           }

       }

 

Assumptions:

I have made several assumptions while writing code for this.

  1. I have assumed that Excel file is a new blank file i.e. there is no hyperlink present.
  2. Sheet1 name is hardcoded.
  3. File Name is Hardcoded.
  4. Can you think of anything by looking at the code? 🙂

Run this code and everything will work. When you will open the Excel sheet and click on the link, it will be treated as hyperlink.

But still one thing is missing. It is not underlined and appear in the blue color. Because you need to apply a style to do that. I leave this exercise to my readers to check how the styles are applied and how they can use this information. In case of anything, let me know 🙂

Please try this approch in every situation where you use Open XML and let me know in case of any questions.

Posted in Open XML SDK, Spreadsheet | Tagged: , , , | 9 Comments »

Basics of Spreadsheet – How to Clear a Column – Part 3

Posted by Ankush on March 11, 2010

Now we understand the basics of Spreadsheet, lets discuss about a simple scenario. Consider this, I have an Excel sheet and I would like to clear a column. To understand what’s required to accomplish this, lets follow:

 Create a sample worksheet.

Extract the content and view Sheet1.xml / SharedString.XML 

Sheetdata( Sheet1.xml) SharedString.XML
<sheetData>
<row r=”1″ spans=”1:2″>
  <c r=”A1″ t=”s”>
        <v>0</v>
  </c>
  <c r=”B1″>
       <v>9</v>
  </c>
</row><row r=”2″ spans=”1:2″>
 <c r=”A2″ t=”s”>
   <v>1</v>
 </c>
 <c r=”B2″>
  <v>10</v>
 </c>
</row>

<row r=”3″ spans=”1:2″>
  <c r=”A3″>
   <v>1</v>
  </c>
 <c r=”B3″>
   <v>11</v>
 </c>
</row>

<row r=”4″ spans=”1:2″>
  <c r=”A4″>
   <v>2</v>
  </c>
 <c r=”B4″>
   <v>12</v>
 </c>
</row>

<row r=”5″ spans=”1:2″>
  <c r=”A5″>
   <v>3</v>
  </c>
 <c r=”B5″>
   <v>13</v>
 </c>
</row>

<row r=”6″ spans=”1:2″>
  <c r=”A6″>
   <v>4</v>
  </c>
 <c r=”B6″>
   <v>14</v>
 </c>
</row>

<row r=”7″ spans=”1:2″>
  <c r=”A7″>
   <v>5</v>
  </c>
 <c r=”B7″>
   <v>15</v>
 </c>
</row>
</sheetData>

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>
<sst xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main&#8221; count=”2″ uniqueCount=”2″>
  <si>
    <t>John</t>
  </si>
  <si>
    <t>Tom</t>
  </si>
</sst>

 

As I have explained in my previous blog that strings are referenced based on index in Sheet1.xml and the actual values are present in the sharedString XML that’s why you don’t see any string stored in SheetData. Now if you look  at the SheetData, you will notice that it is defined row by row basis. So basically If I want to clear a column, I have to delete an XML node for each row element. Let’s just examine one row entry in detail:

 <row r=”1″ spans=”1:2″>
      <c r=”A1″ t=”s”>
        <v>0</v>
      </c>
      <c r=”B1″>
        <v>9</v>
      </c>
    </row> 

If you notice that to refer a column, Excel uses A1 and B1 and every row is defined in that way. Let’s say, you want to clear column A, then you need to go through the every row, find c node (basically cell) which refers to A column and delete that node. Basically

Previous XML After XML
    <row r=”1″ spans=”1:2″>
      <c r=”A1″ t=”s”>
        <v>0</v>
      </c>
      <c r=”B1″>
        <v>9</v>
      </c>
    </row>
    <row r=”2″ spans=”1:2″>
      <c r=”A2″ t=”s”>
        <v>1</v>
      </c>
      <c r=”B2″>
        <v>10</v>
      </c>
    </row>
    <row r=”1″ spans=”1:2″>
      <c r=”A1″>
        <v>9</v>
      </c>
    </row>
    <row r=”2″ spans=”1:2″>
      <c r=”A2″>
        <v>10</v>
      </c>
    </row>

 

So now you understand the concept, let’s have a look at the sample code which clears a column A 

// Clear a column from XL

    string docName=””,  worksheetName=”Sheet1″;
    SpreadsheetDocument document = SpreadsheetDocument.Open(@”Book1.xlsx”, true);
    IEnumerable<Sheet> sheets = document.WorkbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == worksheetName);
    if (sheets.Count() == 0)
    {
        // The specified worksheet does not exist.
        return;
    }
    WorksheetPart worksheetPart = (WorksheetPart)document.WorkbookPart.GetPartById(sheets.First().Id);
    Worksheet worksheet = worksheetPart.Worksheet;
    SheetData sheetData = worksheet.GetFirstChild<SheetData>();
    var cells = sheetData.Descendants<Cell>().Where(s => s.CellReference.Value.Contains(“A”));
    foreach (var item in cells)
    {
        item.Remove();
    }
    worksheet.Save();

But are we done yet????..Still there are some strings left in SharedString which were referenced by column A. For the completeness you can delete them as well but even if you don’t delete them, Excel will do it for you. When you will open the modified sheet in Excel and save it, it will readjust the data for you!!!!!!

Let me know wht you think about this Post!!!!!!

Posted in Open XML SDK, Spreadsheet | 6 Comments »

Basics Of SpreadSheet Part-2

Posted by Ankush on March 8, 2010

This is a continuation series to discuss about SpreadsheetML. Today I am going to discuss about how worksheet stores values, formulas and different technique it uses.

Worksheetx.xml
==================

The worksheet (basically Worksheetx.xml file) is the main component in SpreadsheetML. We already know that SheetData is the section which contain the actual values. To create new values for the worksheet you define rows inside the sheetData element. These rows contain cells ( c element), which contain values (v element). For ex:

<worksheet xmlns=”http://…/spreadsheetml/2006/main&#8221; > <sheetData> <row> <c> <v>42</v> </c> </row> </sheetData> </worksheet>

Here  is great optimization. You see only those rows/cells in which value is present. This can be very handy while reading the values. Basically if you use UI, you need to make sure to read only those cells which contains value. Using Open XML can be very easy in those scenarios where you want to read specific section on workbook.

 There are different techniques to store string value/Formulas in spreadsheet. Let’s have a look at the different techniques in detail:

In-Line String:
===================

This element allows for strings to be expressed directly in the cell definition instead of implementing the shared string table (explained in the same blog).To create a cell with text you use the ‘is’ element, which stands for inline-string.

<c r=”A1″><is><t>String</t> </is></c> 

Sparse table markup
=======================

As you might have noticed that up until now the row and cell position starts from the left top of the spreadsheet and runs outward, the first cell being cell A1. Using this model it takes quite a lot of markup if you only want to put a value in cell. E5. E is the fifth column, 5 the fifth row, that’s 25 cells to just get to cell E5. That is obviously not the most optimal thing to do with regard to the amount of markup and parsing to do. To solve this problem, a row and cell can include a position identifier. To for instance create a spreadsheet with a single row positioned at index 5, you add the r attribute to the row element as seen in markup sample 76. Cells use a similar model to identify their position. The r attribute is to identify the column and row index. Obviously the row index should be equal to the one defined at the row level.

<worksheet xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main”&gt;
<sheetData><row r=”5″><c r=”E5″><v>1234</v></c></row></sheetData></worksheet>

 Also note that you cannot specify row 5 before row 4 and column E before D. Doing so is considered an error in the spreadsheet.

Shared String
==================

If you create a spreadsheet with Microsoft Excel you will notice that cell text differs from the inline strings example you have just seen. Shared strings optimize space requirements when the spreadsheet contains multiple instances of the same string. If you stored these repeated strings using inline string markup, the same markup would appear over and over in the worksheet. There are several downsides to this approach. First of all the file will take more space on disk because of all the redundancy of content. Moreover, loading and saving will also take longer.To optimize the use of strings in a spreadsheet SpreadsheetML takes the approach of storing a single instance of the string in a table, called the shared strings table. The cells then reference the string by index instead of storing the value inline in the cell value.

The shared strings table appears as a separate part inside the package. Each workbook contains only one shared strings part whether the strings appear multiple times in one sheet or in multiple sheets. As I mention that there can only be one shared string table in the spreadsheet. Since each worksheet uses the same, it is referenced from the workbook instead of from each individual worksheet.  The shared string table is defined using the sst element. It uses two attributes to keep track of the content and usage of shared strings. The si element is used to add an item to the shared string. It stands for string-item and is the exact opposite of the inline string element is.

<sst xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main”&gt;
<si><t>Region</t></si><si><t>Sales Person</t></si><si><t>Sales Quota</t></si></sst>

The cells used the type quantifier t to identify them as inline string containers. Now they need to use shared strings instead. The type needs to be set to ‘shared string’, identified using the s value. The value of the cell is used to store the index into the shared string table. A good example is the header row of the sample spreadsheet. It uses the following markup to reference the string table.

<worksheet xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main”><sheetData&gt; <row r=”1″ spans=”1:4″><c r=”A1″ t=”s”><v>0</v></c><c r=”B1″ t=”s”><v>1</v></c><c r=”C1″ t=”s”><v>2</v></c><c r=”D1″ t=”s”><v>3</v></c></row></sheetData></worksheet>

 Formulas
===================

Formulas are stored inside each cell that uses a formula. Using the f element you define the formula text. Formulas can contain mathematical expressions that include a wide range of predefined functions.

<c><f>(A3*B3) * (1-C3)</f> <v>3168</v> </c>

 The v element, which previously contained the direct value, is now used to store the cached formula value based on the last time the formula was calculated. This allows the consumer to postpone calculation of the formula values when the spreadsheet is opened, which would result in longer waits when opening a worksheet. You do not need to specify the value, and if you omit it, it is the responsibility of the Open XML reader to compute the value based on the formula definition. The use of functions in a highly internationalized environment such as the document workspace brings interesting questions. When you run Microsoft Office Excel you are allowed to express functions in the native language. The ‘SUM()’ function in Spanish would become ‘SUMA()’. The important bit to consider is the way in which this is stored in Open XML. The function will be saved in its non-localized form. When a Spanish user would use the SUMA() function from within the consumer, the underlying markup would store the SUM() function. Only the user interface is localized, not the markup. This saves you a great deal of program code when working with functions, and let the developers benefit of working with well-known functions. Formulas cannot only appear inside a worksheet cell. Other elements such as table columns are also capable of storing formulas. The same model is used through-out. The formula is always stored as text.

Shared formulas
=============================

Similar to shared strings there is also an optimization for formulas. Like string values, formulas are often repeated over a range of cells, for instance in a calculated column of a table. They only differ in small details such as the row index the formula uses. For the consumer it is easier to load the first formula into memory and base the repeating formulas on that data than it is to load each formula individually. Unlike the optimization to strings using a separate shared-string part, the shared formula model does not rely on a separate part inside the package. It all takes place right within the worksheet data. A normal formula was defined by just adding the formula element f inside the cell. The formula text was stored within the opening and closing tags of the f element. For shared formulas the formula element has an extra attribute applied to identify it as being a shared formula, similar to identifying the cell type. The first cell using the formula stores three values. It stores the formula text like normal and a reference to the cell range using that formula type. The formula itself is identified using the si attribute. The next cell using that formula uses the same value for this attribute. It does not define the formula text, only the first cell has to store that information.

<row> <c> <v>1</v> </c> <c> <f t=”shared” ref=”C2:C4″ si=”0″>A2 + 1</f> <v>2</v> </c> </row> <row> <c> <v>2</v> </c> <c> <f t=”shared” si=”0″ /> <v>3</v> </c> </row>

Similarly you can create different elements in Excel using UI and then analyze the XML to see how it is defined and learn more….

Now we understand the basics of SpreadSheet, lets discuss about some basics operations in Excel using Open XML:

  • How to format cell as datetime
  • How to format number as number
  • How to add background color
  • How to add a formula
  • How to Clear a column

In my upcomming blogs, I will discuss about these tasks. If you have anything, which you would like me to have a createa blog upon, just let me know.

Here is previous part:  Basics of Spreadsheet – Part 1

Posted in Open XML SDK, Spreadsheet | 3 Comments »

Basics of Spreadsheet – Part 1

Posted by Ankush on March 4, 2010

I have been working with Open XML closely and thought of to write a blog which provides an Introduction to Open XML format. Office Open XML is a ZIP-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. By default, documents created using 2007 Microsoft Office products is based on new, XML-based file formats. Before getting into the discussion, let’s understand the basic terminology.

Package: The ZIP container that holds the components (parts) that comprise the document, as defined by the Open Packaging Conventions specification.

Part: Corresponds to one file in the package. For example, if a user right-clicks an Excel 2007 file and chooses to extract it, you will see files like workbook.xml file and several sheetn.xml files. Each of those files is a part in the package.

Relationships: The method used to specify how the collection of parts are related. This method specifies the connection between a source part and a target resource. Relationships are stored within XML parts (for example, /_rels/.rels) in the document package.

XML: Extensible Markup Language (XML) is a simple, flexible text format designed for electronic publishing and the exchange of a wide variety of data on the Internet and elsewhere.

ZIP: An industry-standard compression archives format used to store and transport files between computers, through e-mail, or over the Internet.

Now we understand this, lets discuss about SpreadsheetML. Let’s say you have created an Excel workbook using excel 2007 which contains some values in cell. You rename the Excel file to .zip and extract the content of it, this is the structure you will see:

_rels folder: This folder contains a .rels file that defines the root relationships within the package.

.rels file: Contains relationships based on the start part (the virtual start part). Relationships are defined with the following format:

 <Relationship Id=”someID” Type=”relationshipType” Target=”targetPart”/>

where Id is a unique string in the .rels file.

Type: The type of the relationship that distinguishes relationships from one another and provides a hint as to the relationship’s purpose. It points to the schema that defines Office XML Formats types.

Target: Points to the folder and file that contain the target of the relationship (another part).

For example: A sheet part is defined as

 <Relationship Id=”rId3″ Type=”http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet” Target=”worksheets/sheet3.xml”/>

A style part is defined as

<Relationship Id=”rId5″ Type=”http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles” Target=”styles.xml”/>

DocProps folder: This folder contains 2 files, app.xml , core.xml which contains the information about workbook, owner, last modified etc.

Xl folder (Main Document Part): The target of the http://schemas.microsoft.com/office/2006/relationships/officeDocument relationship is the main part defining the document; the workbook part for Excel. All other relationships are based on the main document part.

This folder contains subfolders:

_rels folder: This folder contains a .rels file that defines the relationships at workbook level. Basically it contains the relationship for worksheet,SharedString etc.

.rels file: Contains relationships based on the workbook part level.

theme Folder: This folder defines a file which defines themes, theme1.xml. This file stores all of the elements of a theme that are applicable across Microsoft Office and the Spreadsheet settings. This file contains sections for theme color, font, and effect settings.

worksheets Folder: This folder contains the individual Sheet (part).The XML file for individual sheet contains a table like structure for defining data. Inside a worksheet the data can be split up into three distinct sections. The first section contains sheet properties. The second contains the data, using the required sheetData element. Right after sheetData various supporting features can be found such as sheet protection and filter information. I will explained later how the sheetData is defined.

<worksheet xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main” >
<sheetData>
<row>
<c>
<v>1234</v>
</c>
</row>
</sheetData>
</worksheet>

sharedStrings.xml: Shared strings optimize space requirements when the spreadsheet contains multiple instances of the same string. More detail is provided in the second part of this blog.

Styles.xml: This file defines the styles defined within the workbook. It can include cellStyles, Font Style etc.

workbook.xml: The first and foremost task of the workbook part is keeping track of the worksheets, global settings and other shared components of the workbook. The workbook part maintains a list of all the worksheets in the spreadsheet. This list is created to be able to name all the worksheets as well as providing sequencing information for the Open XML application. There are three pieces of data stored inside the workbook part about all the worksheets. Each sheet has a name attached for display in the consumer UI. There is an ID value used for sorting the sheets and finally a relationship ID to point the workbook to the part inside the package where the sheet is being stored. Other information in the workbook part concerns the views, calculations and versioning information as well as other options.

<workbook xmlns=http://schemas.openxmlformats.org/spreadsheetml/2006/main xmlns:r=”http:// schemas.openxmlformats.org /officeDocument/2006/relationships”> <sheets>
<sheet name=”Sheet1″ sheetId=”1″ r:id=”rId1″ /> </sheets> </workbook>

[Content_Types].xml: Provides a listing of the content types for the other parts that are contained in the package. Content types are defined as the types of parts that can be stored in a package:

For ex: This is how it is defined for workbook.xml and Shee1.xml

<Override PartName=”/xl/workbook.xml” ContentType=”application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml” />
  <Override PartName=”/xl/worksheets/sheet1.xml” ContentType=”application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml” />

Other Information: There are many more parts that you can put into a SpreadsheetML package. For example: insert a chart in the Excel file and then extract the content of it. Go to xl folder and you will see Chart folder, drawing folder has been created. In this way you can add objects using Excel UI and then analyze the xml structure to learn about this.

My next blog is going to talk about how data/formulas are defined within XML structure. To be specific worksheet1.xml.

Let me know what you think about this blog. In case of any questions on Open XML, leave a comment and I’ll respond to it quickly.

Happy reading!!!!!!!

Posted in Spreadsheet | 1 Comment »

 
%d bloggers like this: