Lets Learn

Opinion Matters

Archive for March, 2010

Basics of Spreadsheet – How to Clear a Column – Part 3

Posted by Ankush on March 11, 2010

Now we understand the basics of Spreadsheet, lets discuss about a simple scenario. Consider this, I have an Excel sheet and I would like to clear a column. To understand what’s required to accomplish this, lets follow:

 Create a sample worksheet.

Extract the content and view Sheet1.xml / SharedString.XML 

Sheetdata( Sheet1.xml) SharedString.XML
<sheetData>
<row r=”1″ spans=”1:2″>
  <c r=”A1″ t=”s”>
        <v>0</v>
  </c>
  <c r=”B1″>
       <v>9</v>
  </c>
</row><row r=”2″ spans=”1:2″>
 <c r=”A2″ t=”s”>
   <v>1</v>
 </c>
 <c r=”B2″>
  <v>10</v>
 </c>
</row>

<row r=”3″ spans=”1:2″>
  <c r=”A3″>
   <v>1</v>
  </c>
 <c r=”B3″>
   <v>11</v>
 </c>
</row>

<row r=”4″ spans=”1:2″>
  <c r=”A4″>
   <v>2</v>
  </c>
 <c r=”B4″>
   <v>12</v>
 </c>
</row>

<row r=”5″ spans=”1:2″>
  <c r=”A5″>
   <v>3</v>
  </c>
 <c r=”B5″>
   <v>13</v>
 </c>
</row>

<row r=”6″ spans=”1:2″>
  <c r=”A6″>
   <v>4</v>
  </c>
 <c r=”B6″>
   <v>14</v>
 </c>
</row>

<row r=”7″ spans=”1:2″>
  <c r=”A7″>
   <v>5</v>
  </c>
 <c r=”B7″>
   <v>15</v>
 </c>
</row>
</sheetData>

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>
<sst xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main&#8221; count=”2″ uniqueCount=”2″>
  <si>
    <t>John</t>
  </si>
  <si>
    <t>Tom</t>
  </si>
</sst>

 

As I have explained in my previous blog that strings are referenced based on index in Sheet1.xml and the actual values are present in the sharedString XML that’s why you don’t see any string stored in SheetData. Now if you look  at the SheetData, you will notice that it is defined row by row basis. So basically If I want to clear a column, I have to delete an XML node for each row element. Let’s just examine one row entry in detail:

 <row r=”1″ spans=”1:2″>
      <c r=”A1″ t=”s”>
        <v>0</v>
      </c>
      <c r=”B1″>
        <v>9</v>
      </c>
    </row> 

If you notice that to refer a column, Excel uses A1 and B1 and every row is defined in that way. Let’s say, you want to clear column A, then you need to go through the every row, find c node (basically cell) which refers to A column and delete that node. Basically

Previous XML After XML
    <row r=”1″ spans=”1:2″>
      <c r=”A1″ t=”s”>
        <v>0</v>
      </c>
      <c r=”B1″>
        <v>9</v>
      </c>
    </row>
    <row r=”2″ spans=”1:2″>
      <c r=”A2″ t=”s”>
        <v>1</v>
      </c>
      <c r=”B2″>
        <v>10</v>
      </c>
    </row>
    <row r=”1″ spans=”1:2″>
      <c r=”A1″>
        <v>9</v>
      </c>
    </row>
    <row r=”2″ spans=”1:2″>
      <c r=”A2″>
        <v>10</v>
      </c>
    </row>

 

So now you understand the concept, let’s have a look at the sample code which clears a column A 

// Clear a column from XL

    string docName=””,  worksheetName=”Sheet1″;
    SpreadsheetDocument document = SpreadsheetDocument.Open(@”Book1.xlsx”, true);
    IEnumerable<Sheet> sheets = document.WorkbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == worksheetName);
    if (sheets.Count() == 0)
    {
        // The specified worksheet does not exist.
        return;
    }
    WorksheetPart worksheetPart = (WorksheetPart)document.WorkbookPart.GetPartById(sheets.First().Id);
    Worksheet worksheet = worksheetPart.Worksheet;
    SheetData sheetData = worksheet.GetFirstChild<SheetData>();
    var cells = sheetData.Descendants<Cell>().Where(s => s.CellReference.Value.Contains(“A”));
    foreach (var item in cells)
    {
        item.Remove();
    }
    worksheet.Save();

But are we done yet????..Still there are some strings left in SharedString which were referenced by column A. For the completeness you can delete them as well but even if you don’t delete them, Excel will do it for you. When you will open the modified sheet in Excel and save it, it will readjust the data for you!!!!!!

Let me know wht you think about this Post!!!!!!

Posted in Open XML SDK, Spreadsheet | 6 Comments »

Basics Of SpreadSheet Part-2

Posted by Ankush on March 8, 2010

This is a continuation series to discuss about SpreadsheetML. Today I am going to discuss about how worksheet stores values, formulas and different technique it uses.

Worksheetx.xml
==================

The worksheet (basically Worksheetx.xml file) is the main component in SpreadsheetML. We already know that SheetData is the section which contain the actual values. To create new values for the worksheet you define rows inside the sheetData element. These rows contain cells ( c element), which contain values (v element). For ex:

<worksheet xmlns=”http://…/spreadsheetml/2006/main&#8221; > <sheetData> <row> <c> <v>42</v> </c> </row> </sheetData> </worksheet>

Here  is great optimization. You see only those rows/cells in which value is present. This can be very handy while reading the values. Basically if you use UI, you need to make sure to read only those cells which contains value. Using Open XML can be very easy in those scenarios where you want to read specific section on workbook.

 There are different techniques to store string value/Formulas in spreadsheet. Let’s have a look at the different techniques in detail:

In-Line String:
===================

This element allows for strings to be expressed directly in the cell definition instead of implementing the shared string table (explained in the same blog).To create a cell with text you use the ‘is’ element, which stands for inline-string.

<c r=”A1″><is><t>String</t> </is></c> 

Sparse table markup
=======================

As you might have noticed that up until now the row and cell position starts from the left top of the spreadsheet and runs outward, the first cell being cell A1. Using this model it takes quite a lot of markup if you only want to put a value in cell. E5. E is the fifth column, 5 the fifth row, that’s 25 cells to just get to cell E5. That is obviously not the most optimal thing to do with regard to the amount of markup and parsing to do. To solve this problem, a row and cell can include a position identifier. To for instance create a spreadsheet with a single row positioned at index 5, you add the r attribute to the row element as seen in markup sample 76. Cells use a similar model to identify their position. The r attribute is to identify the column and row index. Obviously the row index should be equal to the one defined at the row level.

<worksheet xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main”&gt;
<sheetData><row r=”5″><c r=”E5″><v>1234</v></c></row></sheetData></worksheet>

 Also note that you cannot specify row 5 before row 4 and column E before D. Doing so is considered an error in the spreadsheet.

Shared String
==================

If you create a spreadsheet with Microsoft Excel you will notice that cell text differs from the inline strings example you have just seen. Shared strings optimize space requirements when the spreadsheet contains multiple instances of the same string. If you stored these repeated strings using inline string markup, the same markup would appear over and over in the worksheet. There are several downsides to this approach. First of all the file will take more space on disk because of all the redundancy of content. Moreover, loading and saving will also take longer.To optimize the use of strings in a spreadsheet SpreadsheetML takes the approach of storing a single instance of the string in a table, called the shared strings table. The cells then reference the string by index instead of storing the value inline in the cell value.

The shared strings table appears as a separate part inside the package. Each workbook contains only one shared strings part whether the strings appear multiple times in one sheet or in multiple sheets. As I mention that there can only be one shared string table in the spreadsheet. Since each worksheet uses the same, it is referenced from the workbook instead of from each individual worksheet.  The shared string table is defined using the sst element. It uses two attributes to keep track of the content and usage of shared strings. The si element is used to add an item to the shared string. It stands for string-item and is the exact opposite of the inline string element is.

<sst xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main”&gt;
<si><t>Region</t></si><si><t>Sales Person</t></si><si><t>Sales Quota</t></si></sst>

The cells used the type quantifier t to identify them as inline string containers. Now they need to use shared strings instead. The type needs to be set to ‘shared string’, identified using the s value. The value of the cell is used to store the index into the shared string table. A good example is the header row of the sample spreadsheet. It uses the following markup to reference the string table.

<worksheet xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main”><sheetData&gt; <row r=”1″ spans=”1:4″><c r=”A1″ t=”s”><v>0</v></c><c r=”B1″ t=”s”><v>1</v></c><c r=”C1″ t=”s”><v>2</v></c><c r=”D1″ t=”s”><v>3</v></c></row></sheetData></worksheet>

 Formulas
===================

Formulas are stored inside each cell that uses a formula. Using the f element you define the formula text. Formulas can contain mathematical expressions that include a wide range of predefined functions.

<c><f>(A3*B3) * (1-C3)</f> <v>3168</v> </c>

 The v element, which previously contained the direct value, is now used to store the cached formula value based on the last time the formula was calculated. This allows the consumer to postpone calculation of the formula values when the spreadsheet is opened, which would result in longer waits when opening a worksheet. You do not need to specify the value, and if you omit it, it is the responsibility of the Open XML reader to compute the value based on the formula definition. The use of functions in a highly internationalized environment such as the document workspace brings interesting questions. When you run Microsoft Office Excel you are allowed to express functions in the native language. The ‘SUM()’ function in Spanish would become ‘SUMA()’. The important bit to consider is the way in which this is stored in Open XML. The function will be saved in its non-localized form. When a Spanish user would use the SUMA() function from within the consumer, the underlying markup would store the SUM() function. Only the user interface is localized, not the markup. This saves you a great deal of program code when working with functions, and let the developers benefit of working with well-known functions. Formulas cannot only appear inside a worksheet cell. Other elements such as table columns are also capable of storing formulas. The same model is used through-out. The formula is always stored as text.

Shared formulas
=============================

Similar to shared strings there is also an optimization for formulas. Like string values, formulas are often repeated over a range of cells, for instance in a calculated column of a table. They only differ in small details such as the row index the formula uses. For the consumer it is easier to load the first formula into memory and base the repeating formulas on that data than it is to load each formula individually. Unlike the optimization to strings using a separate shared-string part, the shared formula model does not rely on a separate part inside the package. It all takes place right within the worksheet data. A normal formula was defined by just adding the formula element f inside the cell. The formula text was stored within the opening and closing tags of the f element. For shared formulas the formula element has an extra attribute applied to identify it as being a shared formula, similar to identifying the cell type. The first cell using the formula stores three values. It stores the formula text like normal and a reference to the cell range using that formula type. The formula itself is identified using the si attribute. The next cell using that formula uses the same value for this attribute. It does not define the formula text, only the first cell has to store that information.

<row> <c> <v>1</v> </c> <c> <f t=”shared” ref=”C2:C4″ si=”0″>A2 + 1</f> <v>2</v> </c> </row> <row> <c> <v>2</v> </c> <c> <f t=”shared” si=”0″ /> <v>3</v> </c> </row>

Similarly you can create different elements in Excel using UI and then analyze the XML to see how it is defined and learn more….

Now we understand the basics of SpreadSheet, lets discuss about some basics operations in Excel using Open XML:

  • How to format cell as datetime
  • How to format number as number
  • How to add background color
  • How to add a formula
  • How to Clear a column

In my upcomming blogs, I will discuss about these tasks. If you have anything, which you would like me to have a createa blog upon, just let me know.

Here is previous part:  Basics of Spreadsheet – Part 1

Posted in Open XML SDK, Spreadsheet | 3 Comments »

Basics of Spreadsheet – Part 1

Posted by Ankush on March 4, 2010

I have been working with Open XML closely and thought of to write a blog which provides an Introduction to Open XML format. Office Open XML is a ZIP-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. By default, documents created using 2007 Microsoft Office products is based on new, XML-based file formats. Before getting into the discussion, let’s understand the basic terminology.

Package: The ZIP container that holds the components (parts) that comprise the document, as defined by the Open Packaging Conventions specification.

Part: Corresponds to one file in the package. For example, if a user right-clicks an Excel 2007 file and chooses to extract it, you will see files like workbook.xml file and several sheetn.xml files. Each of those files is a part in the package.

Relationships: The method used to specify how the collection of parts are related. This method specifies the connection between a source part and a target resource. Relationships are stored within XML parts (for example, /_rels/.rels) in the document package.

XML: Extensible Markup Language (XML) is a simple, flexible text format designed for electronic publishing and the exchange of a wide variety of data on the Internet and elsewhere.

ZIP: An industry-standard compression archives format used to store and transport files between computers, through e-mail, or over the Internet.

Now we understand this, lets discuss about SpreadsheetML. Let’s say you have created an Excel workbook using excel 2007 which contains some values in cell. You rename the Excel file to .zip and extract the content of it, this is the structure you will see:

_rels folder: This folder contains a .rels file that defines the root relationships within the package.

.rels file: Contains relationships based on the start part (the virtual start part). Relationships are defined with the following format:

 <Relationship Id=”someID” Type=”relationshipType” Target=”targetPart”/>

where Id is a unique string in the .rels file.

Type: The type of the relationship that distinguishes relationships from one another and provides a hint as to the relationship’s purpose. It points to the schema that defines Office XML Formats types.

Target: Points to the folder and file that contain the target of the relationship (another part).

For example: A sheet part is defined as

 <Relationship Id=”rId3″ Type=”http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet” Target=”worksheets/sheet3.xml”/>

A style part is defined as

<Relationship Id=”rId5″ Type=”http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles” Target=”styles.xml”/>

DocProps folder: This folder contains 2 files, app.xml , core.xml which contains the information about workbook, owner, last modified etc.

Xl folder (Main Document Part): The target of the http://schemas.microsoft.com/office/2006/relationships/officeDocument relationship is the main part defining the document; the workbook part for Excel. All other relationships are based on the main document part.

This folder contains subfolders:

_rels folder: This folder contains a .rels file that defines the relationships at workbook level. Basically it contains the relationship for worksheet,SharedString etc.

.rels file: Contains relationships based on the workbook part level.

theme Folder: This folder defines a file which defines themes, theme1.xml. This file stores all of the elements of a theme that are applicable across Microsoft Office and the Spreadsheet settings. This file contains sections for theme color, font, and effect settings.

worksheets Folder: This folder contains the individual Sheet (part).The XML file for individual sheet contains a table like structure for defining data. Inside a worksheet the data can be split up into three distinct sections. The first section contains sheet properties. The second contains the data, using the required sheetData element. Right after sheetData various supporting features can be found such as sheet protection and filter information. I will explained later how the sheetData is defined.

<worksheet xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main” >
<sheetData>
<row>
<c>
<v>1234</v>
</c>
</row>
</sheetData>
</worksheet>

sharedStrings.xml: Shared strings optimize space requirements when the spreadsheet contains multiple instances of the same string. More detail is provided in the second part of this blog.

Styles.xml: This file defines the styles defined within the workbook. It can include cellStyles, Font Style etc.

workbook.xml: The first and foremost task of the workbook part is keeping track of the worksheets, global settings and other shared components of the workbook. The workbook part maintains a list of all the worksheets in the spreadsheet. This list is created to be able to name all the worksheets as well as providing sequencing information for the Open XML application. There are three pieces of data stored inside the workbook part about all the worksheets. Each sheet has a name attached for display in the consumer UI. There is an ID value used for sorting the sheets and finally a relationship ID to point the workbook to the part inside the package where the sheet is being stored. Other information in the workbook part concerns the views, calculations and versioning information as well as other options.

<workbook xmlns=http://schemas.openxmlformats.org/spreadsheetml/2006/main xmlns:r=”http:// schemas.openxmlformats.org /officeDocument/2006/relationships”> <sheets>
<sheet name=”Sheet1″ sheetId=”1″ r:id=”rId1″ /> </sheets> </workbook>

[Content_Types].xml: Provides a listing of the content types for the other parts that are contained in the package. Content types are defined as the types of parts that can be stored in a package:

For ex: This is how it is defined for workbook.xml and Shee1.xml

<Override PartName=”/xl/workbook.xml” ContentType=”application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml” />
  <Override PartName=”/xl/worksheets/sheet1.xml” ContentType=”application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml” />

Other Information: There are many more parts that you can put into a SpreadsheetML package. For example: insert a chart in the Excel file and then extract the content of it. Go to xl folder and you will see Chart folder, drawing folder has been created. In this way you can add objects using Excel UI and then analyze the xml structure to learn about this.

My next blog is going to talk about how data/formulas are defined within XML structure. To be specific worksheet1.xml.

Let me know what you think about this blog. In case of any questions on Open XML, leave a comment and I’ll respond to it quickly.

Happy reading!!!!!!!

Posted in Spreadsheet | 1 Comment »

 
%d bloggers like this: